Skip to main content
mySpellChecker has a pluggable POS tagging system with multiple backends — from zero-dependency rule-based inference to high-accuracy transformer models. POS tags drive grammar checking, disambiguation, and context-aware suggestions throughout the validation pipeline.

Introduction

What is POS Tagging?

Part-of-Speech tagging assigns grammatical categories (noun, verb, adjective, etc.) to words:
မြန်မာ → N (Noun)
ကောင်း → ADJ (Adjective)
သည် → P_SENT (Sentence-ending Particle)

Why Use POS Tagging?

  • Improved Accuracy: Context-aware spell checking (85-93% accuracy vs 70% without)
  • Better Suggestions: Grammatically appropriate correction suggestions
  • Disambiguation: Distinguish between homonyms based on context
  • Validation: Detect grammatical errors and inconsistencies

Integration Points

The POS tagger integrates at two levels:
  1. Build-Time: The inference engine assigns POS tags when building your dictionary from corpus
  2. Runtime: On-the-fly tagging for OOV (out-of-vocabulary) words during spell checking

Supported Tags

Two Tag Systems

mySpellChecker has two POS sources that produce different tag granularities:
  • Inference Engine (rule-based): Produces granular particle tags (P_SUBJ, P_OBJ, P_SENT, P_MOD, P_LOC) by analyzing suffixes and morphology. Used during dictionary building and for OOV word fallback.
  • Transformer Model (HuggingFace): Produces coarse particle tags (PPM, PART) because the underlying model was trained on a coarser tag set. Used for high-accuracy runtime tagging.
The transformer model does not distinguish between particle types — it outputs PPM (postpositional marker) or PART (general particle) for all particles. Granular particle tags (P_SUBJ, P_OBJ, etc.) come from the inference engine and the dictionary.

Complete POS Tag Set

Core Tags (all systems)

TagDescriptionExamplesSource
NNounမြန်မာ (Myanmar), နိုင်ငံ (country), လူ (person)All
VVerbစား (eat), သွား (go), လုပ် (do), ရေး (write)All
ADJAdjectiveကောင်း (good), လှ (beautiful), ကြီး (big)All
ADVAdverbအလွန် (very), မြန်မြန် (quickly), ဖြည်းဖြည်း (slowly)All
NUMNumberတစ် (one), နှစ် (two), ၁၀ (10)All
PRONPronounကျွန်တော် (I), သူ (he/she), သူတို့ (they)All
CONJConjunctionနှင့် (and), သို့မဟုတ် (or), ဒါပေမယ့် (but)All
INTInterjectionဟယ် (hey), အို (oh), ဟာ (wow)All
UNKUnknownAll

Granular Particle Tags (inference engine only)

These tags are produced by the rule-based inference engine and stored in the dictionary. The transformer model cannot distinguish between particle types — it uses coarse tags instead.
TagDescriptionExamples
P_SUBJSubject/Topic Particleက, ကား, ဟာ, မှာ
P_OBJObject Particleကို, အား
P_SENTSentence Endingသည်, တယ်, မယ်, ပြီ, ပါ
P_MODModifier Particleသော, တဲ့, နဲ့, လို, ဖြင့်
P_LOCLocation/Directionမှ (from), သို့ (to), ဆီ (towards), တွင် (in/at)
PGeneral Particleလည်း (also), ပဲ (only), တော့ (as for)

Coarse / Transformer-Only Tags

These tags come from the HuggingFace transformer model. They are broader categories that don’t distinguish particle subtypes.
TagDescriptionNotes
PPMPostpositional MarkerCovers all particles (P_SUBJ, P_OBJ, P_SENT, P_MOD, P_LOC)
PARTGeneral ParticleCatch-all for particles not classified as PPM
PUNCTPunctuation။, ၊
ABBAbbreviationShortened forms
FWForeign WordNon-Myanmar words
SBSymbolSpecial symbols
TNText NumberNumbers written in text form

Transformer Tag Mapping

The HuggingFace model (chuuhtetnaing/myanmar-pos-model) outputs lowercase tags. The TransformerPOSTagger maps them to the internal uppercase convention via HF_TO_INTERNAL_TAG_MAP:
# HuggingFace → Internal mapping (pos_tagger_transformer.py)
HF_TO_INTERNAL_TAG_MAP = {
    "n": "N",       "v": "V",       "adj": "ADJ",
    "adv": "ADV",   "pron": "PRON", "num": "NUM",
    "conj": "CONJ", "int": "INT",   "punc": "PUNCT",
    "ppm": "PPM",   "part": "PART",
    "abb": "ABB",   "fw": "FW",     "sb": "SB",   "tn": "TN",
}

Tag Disambiguation Guidelines

Many Myanmar words can have multiple POS tags depending on context. Here are common ambiguities:

1. Noun vs. Verb Ambiguity

Some words function as both nouns and verbs:
WordAs NounAs VerbResolution
စာbook/letter (N)to read (V)Check for preceding ကို or following particle
အလုပ်work/job (N)to work (V)Check sentence structure
ပညာknowledge (N)to educate (V)Rare as verb, default to N
Resolution rule: If followed by ကို/အား, it’s likely a noun. If followed by တယ်/ပြီ, it’s a verb.

2. Adjective vs. Verb Ambiguity

Myanmar adjectives often function as stative verbs:
WordContextTagExample
ကောင်းstandalone predicateVအဲဒါ ကောင်းတယ် (That is good)
ကောင်းmodifier before nounADJကောင်းတဲ့ လူ (good person)
လှsentence-finalVသူ လှတယ် (She is beautiful)
လှwith သော/တဲ့ADJလှသော မိန်းကလေး (beautiful girl)
Resolution rule: With modifier particle (သော, တဲ့) → ADJ. As predicate → V.

3. Particle Disambiguation

Particles require careful context analysis:
WordPossible TagsContextExample
ကP_SUBJAfter subject nounသူက သွားတယ်
ကP_LOCWith location meaningရန်ကုန်က လာတယ် (came from Yangon)
မှာP_SUBJTopic markerဒါမှာ ကောင်းတယ်
မှာP_LOCLocation markerစားပွဲမှာ ရှိတယ် (is on the table)
မှာV”to order” meaningထမင်းမှာမယ် (will order rice)

Annotation Guidelines

When annotating Myanmar text for POS tagging:
  1. Segment first: Ensure proper word boundaries before tagging
  2. Context matters: Always consider surrounding words for disambiguation
  3. Particle chains: Tag each particle in a chain separately
    • Example: သွားပါမယ် = V(သွား) + P_SENT(ပါ) + P_SENT(မယ်)
  4. Compound words: Tag as single unit if dictionary entry exists
    • Example: ကျောင်းသား (student) = N (not N + N)
  5. Numbers: Use NUM for digits and number words
  6. Punctuation: Exclude from POS tagging (handled separately)

Common Annotation Errors to Avoid

ErrorIncorrectCorrectReason
Particle as nounသည် (N)သည် (P_SENT)Sentence-final particles aren’t nouns
Missing particleကောင်းတယ် (V)ကောင်း (V) + တယ် (P_SENT)Segment particles separately
Verb as adjectiveကောင်း (ADJ) in predicateကောင်း (V)Predicative = verb
Wrong particle typeက (P)က (P_SUBJ)Use specific particle tags when available

Quick Start

Default Configuration (Rule-Based)

No setup required - works out of the box with zero dependencies:
from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider

# Uses default rule-based tagger
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)
result = checker.check("မြန်မာ နိုင်ငံ")

Upgrading to Transformer (High Accuracy)

Install transformers package and configure:
# Install with transformer support
pip install myspellchecker[transformers]
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="transformer",
        device=0,  # GPU (use -1 for CPU)
    )
)

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

Using Custom Models

Point to your fine-tuned HuggingFace model:
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig

config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="transformer",
        model_name="your-username/your-myanmar-pos-model",
        device=-1,  # CPU
    )
)

Tagger Types

1. Rule-Based (Default)

Best for: Quick setup, no dependencies, production environments with tight resource constraints Characteristics:
  • Fast suffix-based morphological analysis
  • Produces granular particle tags (P_SUBJ, P_OBJ, P_SENT, P_MOD, P_LOC)
  • No external dependencies
  • Fork-safe for multiprocessing
  • Lowest memory footprint
Performance:
  • Speed: Very Fast
  • Accuracy: ~70%
  • Memory: Very Low
  • Dependencies: None
How it works:
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger

tagger = RuleBasedPOSTagger(
    use_morphology_fallback=True,
    cache_size=10000,
    unknown_tag="UNK"
)

tag = tagger.tag_word("စားပြီ")  # Returns: P_SENT
Fallback chain:
  1. Check pos_map (if provided)
  2. Morphological suffix analysis
  3. Return “UNK” for unknown words

2. Transformer (Highest Accuracy)

Best for: Maximum accuracy, when GPU is available, offline processing Characteristics:
  • Pre-trained neural models from HuggingFace
  • Context-aware sequence tagging
  • Produces coarse particle tags (PPM, PART) — mapped from HF lowercase tags
  • Requires GPU for optimal speed
  • Not fork-safe (CUDA limitations)
Performance:
  • Speed: Slow (CPU), Fast (GPU)
  • Accuracy: ~93%
  • Memory: ~500 MB (model) + ~100 MB (buffer)
  • Dependencies: transformers>=4.30.0, torch>=2.0.0
Default model: chuuhtetnaing/myanmar-pos-model (XLM-RoBERTa-based, 93.37% accuracy) How it works:
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger

tagger = TransformerPOSTagger(
    model_name="chuuhtetnaing/myanmar-pos-model",
    device=0,  # GPU
    batch_size=32,
    max_length=128
)

# Single word
tag = tagger.tag_word("မြန်မာ")  # Returns: N

# Sequence (context-aware)
tags = tagger.tag_sequence(["မြန်မာ", "နိုင်ငံ", "သည်"])
# Returns: ['N', 'N', 'PPM']  (coarse particle tag)

# With confidence scores
prediction = tagger.tag_word_with_confidence("ကောင်း")
print(f"{prediction.tag} (confidence: {prediction.confidence:.2f})")
# Output: ADJ (confidence: 0.95)

3. Viterbi HMM

Best for: Context-aware tagging without GPU, balanced accuracy/speed Characteristics:
  • Hidden Markov Model with Viterbi algorithm
  • Uses trigram transition probabilities
  • Requires pre-built probability tables
  • Fork-safe
Performance:
  • Speed: Fast
  • Accuracy: ~85% (with probability tables), ~70% (fallback to morphology)
  • Memory: ~50 MB (probability tables)
  • Dependencies: None (pure Python + optional Cython)
Database Requirements: The Viterbi tagger requires POS probability tables in the database:
TableDescription
pos_unigramsP(tag) - Prior tag probabilities
pos_bigramsP(tag2 | tag1) - Tag transition probabilities
pos_trigramsP(tag3 | tag1, tag2) - Trigram context
Building database with POS probabilities:
# Build with POS tagging enabled (populates probability tables)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer

# Or use sample database (includes pre-computed probabilities)
myspellchecker build --sample
Note: If probability tables are empty, Viterbi falls back to morphological analysis with reduced accuracy (~70%).
How it works:
from myspellchecker.algorithms.pos_tagger_factory import POSTaggerFactory
from myspellchecker.providers import SQLiteProvider

# Requires provider with POS probability tables
provider = SQLiteProvider("mydict.db")

tagger = POSTaggerFactory.create("viterbi", provider=provider, beam_width=10)

# Context-aware sequence tagging
tags = tagger.tag_sequence(["မြန်မာ", "နိုင်ငံ", "သည်"])

4. Custom Tagger

Best for: Domain-specific requirements, research experiments Implement your own tagger by inheriting from POSTaggerBase:
from myspellchecker.algorithms.pos_tagger_base import POSTaggerBase, TaggerType

class MyCustomTagger(POSTaggerBase):
    def tag_word(self, word: str) -> str:
        # Your logic here
        return "N"

    def tag_sequence(self, words: list[str]) -> list[str]:
        # Your logic here
        return ["N"] * len(words)

    @property
    def tagger_type(self) -> TaggerType:
        return TaggerType.CUSTOM

# Use via factory
from myspellchecker.algorithms.pos_tagger_factory import POSTaggerFactory
from myspellchecker.core.config import POSTaggerConfig

tagger = POSTaggerFactory.create("custom", provider=provider)

Configuration

POSTaggerConfig

Central configuration for POS tagger system:
from myspellchecker.core.config import POSTaggerConfig

config = POSTaggerConfig(
    # Tagger selection
    tagger_type="transformer",  # "rule_based" | "transformer" | "viterbi"

    # Transformer settings
    model_name="chuuhtetnaing/myanmar-pos-model",
    device=-1,  # -1=CPU, 0+=GPU index
    batch_size=32,
    cache_dir=None,  # Model cache directory

    # Rule-based settings
    cache_size=10000,
    unknown_tag="UNK",
    use_morphology_fallback=True,

    # Viterbi settings
    beam_width=10,
    emission_weight=1.2,
    min_prob=1e-10,
)

Environment Variables

Configure via environment variables (useful for deployment):
# Tagger type
export MYSPELL_POS_TAGGER_TYPE="transformer"

# Model selection
export MYSPELL_POS_TAGGER_MODEL_NAME="your-username/model"

# Beam width for Viterbi tagger
export MYSPELL_POS_TAGGER_BEAM_WIDTH="15"

Configuration Priority

  1. Explicit config in code (highest priority)
  2. Environment variables
  3. Default values (lowest priority)

Build-Time Usage

CLI - Building Dictionaries

Default (Rule-Based)

myspellchecker build \
  -i corpus.txt \
  -o mydict.db \
  --sample=false

With Transformer Tagger

myspellchecker build \
  -i corpus.txt \
  -o mydict.db \
  --pos-tagger transformer \
  --pos-model chuuhtetnaing/myanmar-pos-model \
  --pos-device 0 \
  --sample=false

With Custom Model

myspellchecker build \
  -i corpus.txt \
  -o mydict.db \
  --pos-tagger transformer \
  --pos-model /path/to/my/finetuned/model \
  --pos-device -1 \
  --sample=false

Python API - Building Dictionaries

from myspellchecker.data_pipeline.pipeline import Pipeline
from myspellchecker.data_pipeline.config import PipelineConfig
from myspellchecker.core.config import POSTaggerConfig

# Configure pipeline with POS tagger
config = PipelineConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="transformer",
        model_name="chuuhtetnaing/myanmar-pos-model",
        device=0,  # GPU
        batch_size=64,  # Larger batch for build-time
    ),
    keep_intermediate=False,
)

# Build database
pipeline = Pipeline(config=config, work_dir="temp_build")
pipeline.build_database(
    input_files=["corpus1.txt", "corpus2.txt"],
    database_path="mydict.db",
    sample=False,
)

Runtime Usage

SpellChecker Configuration

Default (Rule-Based)

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider

# Explicit provider (required — no bundled database)
provider = SQLiteProvider(database_path="mydict.db")
checker = SpellChecker(provider=provider)

With Transformer

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig
from myspellchecker.providers import SQLiteProvider
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger

# Create transformer tagger
tagger = TransformerPOSTagger(
    model_name="chuuhtetnaing/myanmar-pos-model",
    device=0,  # GPU
)

# Create provider with tagger
provider = SQLiteProvider(database_path="mydict.db", pos_tagger=tagger)

# Create config and spell checker
config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(tagger_type="transformer", device=0)
)
checker = SpellChecker(config=config, provider=provider)

# Use spell checker
result = checker.check("မြန်မာ နိုင်ငံ ကောင်း သည်")

OOV Word Handling

The POS tagger provides fallback for out-of-vocabulary words:
from myspellchecker.providers import SQLiteProvider
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger

provider = SQLiteProvider(database_path="mydict.db")

# For known word: database lookup
pos = provider.get_word_pos("မြန်မာ")  # Returns: N (from database)

# For OOV word: tagger fallback
pos = provider.get_word_pos("CompletelyUnknownWord123")  # Returns: UNK
Fallback chain:
  1. Database lookup
  2. Stemming + root lookup
  3. POS tagger
  4. Morphology analyzer (backward compatibility)
  5. Return None or “UNK”

Performance Comparison

Comparison

TaggerSpeedAccuracyMemoryContext-Aware
Rule-BasedVery Fast~70%Very LowNo
ViterbiFast~85%LowYes
TransformerSlow (CPU) / Fast (GPU)~93%HighYes

Recommendation Matrix

Use CaseRecommended TaggerReason
Production APIRule-Based or ViterbiFast, low memory, no GPU needed
Batch ProcessingTransformer (GPU)Highest accuracy, GPU parallelism
Offline AnalysisTransformer (CPU)Accuracy over speed
Embedded SystemsRule-BasedMinimal footprint
ResearchTransformer or CustomFlexibility and accuracy

Troubleshooting

Missing Dependencies

Error: ImportError: transformers required Solution:
pip install myspellchecker[transformers]
# Or manually:
pip install transformers>=4.30.0 torch>=2.0.0
Verification:
try:
    from transformers import pipeline
    print("Transformers installed")
except ImportError:
    print("Transformers not installed")

CUDA Errors

Error: RuntimeError: CUDA out of memory Solutions:
  1. Reduce batch size:
config = POSTaggerConfig(
    tagger_type="transformer",
    batch_size=8,  # Reduce from default 32
)
  1. Use CPU:
config = POSTaggerConfig(
    tagger_type="transformer",
    device=-1,  # Force CPU
)
  1. Clear GPU cache:
import torch
torch.cuda.empty_cache()
Error: RuntimeError: CUDA error: device-side assert triggered Solution: Usually model/data mismatch. Verify:
tagger = TransformerPOSTagger(device=0)
# Ensure input is valid Myanmar Unicode text
tag = tagger.tag_word("မြန်မာ")  # Valid
# tag = tagger.tag_word(None)  # Invalid - will crash

Model Loading Failures

Error: OSError: Can't load model from 'nonexistent/model' Solutions:
  1. Verify model exists:
# Check HuggingFace model
curl -I https://huggingface.co/chuuhtetnaing/myanmar-pos-model

# Or use local path
ls /path/to/my/model/config.json
  1. Check internet connection (for HuggingFace downloads):
import requests
response = requests.get("https://huggingface.co")
print(f"Status: {response.status_code}")
  1. Use cache directory:
config = POSTaggerConfig(
    tagger_type="transformer",
    cache_dir="/path/to/cache",  # Persistent cache
)
  1. Download manually:
# Download model to local directory
huggingface-cli download chuuhtetnaing/myanmar-pos-model --local-dir ./my-model

# Use local path
python -c "
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger
tagger = TransformerPOSTagger(model_name='./my-model')
"

Fork-Safety Issues

Error: RuntimeError: Cannot re-initialize CUDA in forked subprocess Cause: Transformer models use CUDA which is not fork-safe. Solution: Use rule-based or Viterbi tagger for multiprocessing:
from multiprocessing import Pool
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger

# Fork-safe
tagger = RuleBasedPOSTagger()

def process_batch(words):
    return [tagger.tag_word(w) for w in words]

with Pool(4) as pool:
    results = pool.map(process_batch, batches)

# NOT fork-safe
# tagger = TransformerPOSTagger()  # Will crash in forked processes
Alternative: Use spawn instead of fork:
from multiprocessing import get_context

with get_context("spawn").Pool(4) as pool:
    results = pool.map(process_batch, batches)

Performance Issues

Slow tagging with transformer:
  1. Use GPU:
config = POSTaggerConfig(device=0)  # GPU 0
  1. Increase batch size:
config = POSTaggerConfig(batch_size=64)  # Default: 32
  1. Use quantization (trade accuracy for speed):
# Requires torch>=2.0
from transformers import AutoModelForTokenClassification
import torch

model = AutoModelForTokenClassification.from_pretrained(
    "chuuhtetnaing/myanmar-pos-model"
)
model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

Joint Segmentation and Tagging

Overview

Joint segmentation and tagging is an advanced mode that performs word segmentation and POS tagging simultaneously in a single Viterbi pass. This is different from the default sequential approach where text is first segmented, then tagged. Default Behavior (Sequential Mode):
Input Text -> Segmenter -> Words -> POS Tagger -> Tagged Words
Joint Mode:
Input Text -> Joint Viterbi Decoder -> Words + Tags (simultaneously)

Why It’s Disabled by Default

Joint mode is disabled by default (config.joint.enabled=False) for several important reasons:
ReasonExplanation
Increased ComplexityState space is O(positions x word_lengths x tags^2) vs O(words x tags^2) for sequential
Higher Memory UsageBeam search over joint state space requires more memory
Less TestedSequential pipeline has more extensive production testing
Similar AccuracyFor most use cases, sequential mode achieves comparable results
Startup OverheadJoint mode requires loading additional probability tables

When to Enable Joint Mode

Joint mode may provide benefits in specific scenarios:
Use CaseBenefitEnable Joint?
Ambiguous segmentationPOS context helps resolve word boundariesYes
OOV-heavy textJoint optimization handles unknown words betterYes
Research/ExperimentsComparing segmentation approachesYes
Production APILatency-sensitive, well-segmented textNo
Simple validationBasic spell checkingNo

Configuration

Enable Joint Mode

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, JointConfig
from myspellchecker.providers import SQLiteProvider

# Enable joint segmentation-tagging
config = SpellCheckerConfig(
    joint=JointConfig(
        enabled=True,
        beam_width=15,  # Larger beam for joint state space
        max_word_length=20,
        emission_weight=1.2,
        word_score_weight=1.0,
    )
)

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

Using SpellCheckerBuilder

from myspellchecker.core.builder import SpellCheckerBuilder

checker = (
    SpellCheckerBuilder()
    .with_joint_segmentation(enabled=True)
    .build()
)

# Perform joint segmentation and tagging
words, tags = checker.segment_and_tag("မြန်မာနိုင်ငံ")
print(list(zip(words, tags)))
# Output: [('မြန်မာ', 'N'), ('နိုင်ငံ', 'N')]

JointConfig Parameters

ParameterTypeDefaultDescription
enabledboolFalseEnable joint segmentation-tagging mode
beam_widthint15Beam width for Viterbi decoding (larger = more accurate, slower)
max_word_lengthint20Maximum word length in characters
emission_weightfloat1.2Weight for P(tag | word) emission probabilities
word_score_weightfloat1.0Weight for word n-gram scores
min_probfloat1e-10Minimum probability threshold to prevent underflow
use_morphology_fallbackboolTrueUse morphology analyzer for OOV word tagging

Performance Comparison

ModeSpeedMemoryBest For
SequentialFastLowProduction, latency-sensitive
JointModerateHigherAmbiguous text, research
Note: Performance varies based on text complexity and hardware.

Usage Example

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, JointConfig

# Sequential mode (default)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
sequential_checker = SpellChecker(provider=provider)
words_seq, tags_seq = sequential_checker.segment_and_tag("မြန်မာနိုင်ငံသည်ကောင်းသည်")

# Joint mode
joint_config = SpellCheckerConfig(
    joint=JointConfig(enabled=True)
)
joint_checker = SpellChecker(config=joint_config)
words_joint, tags_joint = joint_checker.segment_and_tag("မြန်မာနိုင်ငံသည်ကောင်းသည်")

# Compare results
print(f"Sequential: {list(zip(words_seq, tags_seq))}")
print(f"Joint: {list(zip(words_joint, tags_joint))}")

Technical Details

The joint decoder uses a unified Viterbi algorithm that optimizes:
argmax P(words, tags | text)
  = argmax P(word_i) x P(tag_i | tag_{i-1}, tag_{i-2}) x P(tag_i | word_i)
State representation: (position, word_start, current_tag, prev_tag) Scoring components:
  1. Word score: log P(word | prev_word) - N-gram language model
  2. Transition score: log P(tag | prev_tags) - POS tag sequence model
  3. Emission score: log P(tag | word) - Word-to-tag emission probability

Limitations

  1. Requires probability tables: Joint mode needs bigram/trigram probabilities in the database
  2. Not all segmenters support it: Only JointSegmentTagger implements joint mode
  3. Base segmenters raise NotImplementedError: Individual segmenters don’t support joint mode; use SpellChecker.segment_and_tag() instead

Advanced Topics

Fine-Tuning Custom Models

Train your own Myanmar POS tagger on domain-specific data:
# 1. Prepare training data (word, POS tag pairs)
training_data = [
    ("မြန်မာ", "N"),
    ("နိုင်ငံ", "N"),
    ("ကောင်း", "ADJ"),
    # ... more examples
]

# 2. Use HuggingFace Trainer (example)
from transformers import (
    AutoModelForTokenClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)

model = AutoModelForTokenClassification.from_pretrained(
    "xlm-roberta-base",
    num_labels=len(pos_tags)
)

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

# 3. Train (simplified)
trainer = Trainer(
    model=model,
    args=TrainingArguments(output_dir="./my-myanmar-pos"),
    train_dataset=train_dataset,
)
trainer.train()

# 4. Use your model
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger
tagger = TransformerPOSTagger(model_name="./my-myanmar-pos")

Extending with Custom Taggers

Create domain-specific taggers:
from myspellchecker.algorithms.pos_tagger_base import POSTaggerBase, POSPrediction, TaggerType

class DomainSpecificTagger(POSTaggerBase):
    """Medical domain POS tagger."""

    def __init__(self, medical_terms_dict):
        self.medical_terms = medical_terms_dict

    def tag_word(self, word: str) -> str:
        # Check medical terminology first
        if word in self.medical_terms:
            return self.medical_terms[word]

        # Fallback to heuristics
        if word.endswith("ရောဂါ"):
            return "N_DISEASE"

        return "UNK"

    def tag_sequence(self, words: list[str]) -> list[str]:
        return [self.tag_word(w) for w in words]

    @property
    def tagger_type(self) -> TaggerType:
        return TaggerType.CUSTOM

# Usage
medical_dict = {
    "ငှက်ဖျားရောဂါ": "N_DISEASE",
    "ဆေးဝါး": "N_MEDICINE",
}
tagger = DomainSpecificTagger(medical_dict)

Combining Multiple Taggers

Ensemble approach for higher accuracy:
class EnsembleTagger(POSTaggerBase):
    def __init__(self, taggers: list[POSTaggerBase], weights: list[float]):
        self.taggers = taggers
        self.weights = weights

    def tag_word_with_confidence(self, word: str) -> POSPrediction:
        predictions = [
            t.tag_word_with_confidence(word) for t in self.taggers
        ]

        # Weighted voting
        votes = {}
        for pred, weight in zip(predictions, self.weights):
            votes[pred.tag] = votes.get(pred.tag, 0) + weight * pred.confidence

        best_tag = max(votes, key=votes.get)
        confidence = votes[best_tag] / sum(self.weights)

        return POSPrediction(word=word, tag=best_tag, confidence=confidence)

    # ... implement other methods

# Usage
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger

ensemble = EnsembleTagger(
    taggers=[
        RuleBasedPOSTagger(),
        TransformerPOSTagger(),
    ],
    weights=[0.3, 0.7]  # Trust transformer more
)

Caching Strategies

Optimize performance with intelligent caching:
from functools import lru_cache
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger

class CachedTransformerTagger(TransformerPOSTagger):
    def __init__(self, *args, cache_size=10000, **kwargs):
        super().__init__(*args, **kwargs)
        self._setup_cache(cache_size)

    def _setup_cache(self, cache_size):
        self._tag_word_cached = lru_cache(maxsize=cache_size)(
            super().tag_word
        )

    def tag_word(self, word: str) -> str:
        return self._tag_word_cached(word)

# Usage - 10x speedup for repeated words
tagger = CachedTransformerTagger(cache_size=50000)

Acknowledgments

Transformer POS Model

The default transformer-based POS tagger uses the myanmar-pos-model by Chuu Htet Naing:
AttributeValue
Modelchuuhtetnaing/myanmar-pos-model
AuthorChuu Htet Naing
Base ModelXLM-RoBERTa
Accuracy93.37%
F1 Score92.24%
LicensePlease refer to the model’s Hugging Face page for license information
This model was trained specifically for Myanmar/Burmese Part-of-Speech tagging and provides state-of-the-art accuracy for the language. Citation: If you use the transformer POS tagger in your research, please cite the original model:
@misc{chuuhtetnaing-myanmar-pos,
  author = {Chuu Htet Naing},
  title = {Myanmar POS Model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/chuuhtetnaing/myanmar-pos-model}
}
We express our gratitude to Chuu Htet Naing for making this model publicly available, which significantly enhances the accuracy of Myanmar language processing in mySpellChecker.

See Also