POS Tagging System - mySpellChecker

mySpellChecker has a pluggable POS tagging system with multiple backends — from zero-dependency rule-based inference to high-accuracy transformer models. POS tags drive grammar checking, disambiguation, and context-aware suggestions throughout the validation pipeline.

Introduction

What is POS Tagging?

Part-of-Speech tagging assigns grammatical categories (noun, verb, adjective, etc.) to words:

မြန်မာ → N (Noun)
ကောင်း → ADJ (Adjective)
သည် → P_SENT (Sentence-ending Particle)

Why Use POS Tagging?

Improved Accuracy: Context-aware spell checking (85-93% accuracy vs 70% without)
Better Suggestions: Grammatically appropriate correction suggestions
Disambiguation: Distinguish between homonyms based on context
Validation: Detect grammatical errors and inconsistencies

Integration Points

The POS tagger integrates at two levels:

Build-Time: The inference engine assigns POS tags when building your dictionary from corpus
Runtime: On-the-fly tagging for OOV (out-of-vocabulary) words during spell checking

Supported Tags

Two Tag Systems

mySpellChecker has two POS sources that produce different tag granularities:

Inference Engine (rule-based): Produces granular particle tags (P_SUBJ, P_OBJ, P_SENT, P_MOD, P_LOC) by analyzing suffixes and morphology. Used during dictionary building and for OOV word fallback.
Transformer Model (HuggingFace): Produces coarse particle tags (PPM, PART) because the underlying model was trained on a coarser tag set. Used for high-accuracy runtime tagging.

The transformer model does not distinguish between particle types — it outputs PPM (postpositional marker) or PART (general particle) for all particles. Granular particle tags (P_SUBJ, P_OBJ, etc.) come from the inference engine and the dictionary.

Complete POS Tag Set

Core Tags (all systems)

Tag	Description	Examples	Source
N	Noun	မြန်မာ (Myanmar), နိုင်ငံ (country), လူ (person)	All
V	Verb	စား (eat), သွား (go), လုပ် (do), ရေး (write)	All
ADJ	Adjective	ကောင်း (good), လှ (beautiful), ကြီး (big)	All
ADV	Adverb	အလွန် (very), မြန်မြန် (quickly), ဖြည်းဖြည်း (slowly)	All
NUM	Number	တစ် (one), နှစ် (two), ၁၀ (10)	All
PRON	Pronoun	ကျွန်တော် (I), သူ (he/she), သူတို့ (they)	All
CONJ	Conjunction	နှင့် (and), သို့မဟုတ် (or), ဒါပေမယ့် (but)	All
INT	Interjection	ဟယ် (hey), အို (oh), ဟာ (wow)	All
UNK	Unknown	—	All

Granular Particle Tags (inference engine only)

These tags are produced by the rule-based inference engine and stored in the dictionary. The transformer model cannot distinguish between particle types — it uses coarse tags instead.

Tag	Description	Examples
P_SUBJ	Subject/Topic Particle	က, ကား, ဟာ, မှာ
P_OBJ	Object Particle	ကို, အား
P_SENT	Sentence Ending	သည်, တယ်, မယ်, ပြီ, ပါ
P_MOD	Modifier Particle	သော, တဲ့, နဲ့, လို, ဖြင့်
P_LOC	Location/Direction	မှ (from), သို့ (to), ဆီ (towards), တွင် (in/at)
P	General Particle	လည်း (also), ပဲ (only), တော့ (as for)

Coarse / Transformer-Only Tags

These tags come from the HuggingFace transformer model. They are broader categories that don’t distinguish particle subtypes.

Tag	Description	Notes
PPM	Postpositional Marker	Covers all particles (P_SUBJ, P_OBJ, P_SENT, P_MOD, P_LOC)
PART	General Particle	Catch-all for particles not classified as PPM
PUNCT	Punctuation	။, ၊
ABB	Abbreviation	Shortened forms
FW	Foreign Word	Non-Myanmar words
SB	Symbol	Special symbols
TN	Text Number	Numbers written in text form

Transformer Tag Mapping

The HuggingFace model (chuuhtetnaing/myanmar-pos-model) outputs lowercase tags. The TransformerPOSTagger maps them to the internal uppercase convention via HF_TO_INTERNAL_TAG_MAP:

# HuggingFace → Internal mapping (pos_tagger_transformer.py)
HF_TO_INTERNAL_TAG_MAP = {
    "n": "N",       "v": "V",       "adj": "ADJ",
    "adv": "ADV",   "pron": "PRON", "num": "NUM",
    "conj": "CONJ", "int": "INT",   "punc": "PUNCT",
    "ppm": "PPM",   "part": "PART",
    "abb": "ABB",   "fw": "FW",     "sb": "SB",   "tn": "TN",
}

Tag Disambiguation Guidelines

Many Myanmar words can have multiple POS tags depending on context. Here are common ambiguities:

1. Noun vs. Verb Ambiguity

Some words function as both nouns and verbs:

Word	As Noun	As Verb	Resolution
စာ	book/letter (N)	to read (V)	Check for preceding ကို or following particle
အလုပ်	work/job (N)	to work (V)	Check sentence structure
ပညာ	knowledge (N)	to educate (V)	Rare as verb, default to N

Resolution rule: If followed by ကို/အား, it’s likely a noun. If followed by တယ်/ပြီ, it’s a verb.

2. Adjective vs. Verb Ambiguity

Myanmar adjectives often function as stative verbs:

Word	Context	Tag	Example
ကောင်း	standalone predicate	V	အဲဒါ ကောင်းတယ် (That is good)
ကောင်း	modifier before noun	ADJ	ကောင်းတဲ့ လူ (good person)
လှ	sentence-final	V	သူ လှတယ် (She is beautiful)
လှ	with သော/တဲ့	ADJ	လှသော မိန်းကလေး (beautiful girl)

Resolution rule: With modifier particle (သော, တဲ့) → ADJ. As predicate → V.

3. Particle Disambiguation

Particles require careful context analysis:

Word	Possible Tags	Context	Example
က	P_SUBJ	After subject noun	သူက သွားတယ်
က	P_LOC	With location meaning	ရန်ကုန်က လာတယ် (came from Yangon)
မှာ	P_SUBJ	Topic marker	ဒါမှာ ကောင်းတယ်
မှာ	P_LOC	Location marker	စားပွဲမှာ ရှိတယ် (is on the table)
မှာ	V	”to order” meaning	ထမင်းမှာမယ် (will order rice)

Annotation Guidelines

When annotating Myanmar text for POS tagging:

Segment first: Ensure proper word boundaries before tagging
Context matters: Always consider surrounding words for disambiguation
Particle chains: Tag each particle in a chain separately
- Example: သွားပါမယ် = V(သွား) + P_SENT(ပါ) + P_SENT(မယ်)
Compound words: Tag as single unit if dictionary entry exists
- Example: ကျောင်းသား (student) = N (not N + N)
Numbers: Use NUM for digits and number words
Punctuation: Exclude from POS tagging (handled separately)

Common Annotation Errors to Avoid

Error	Incorrect	Correct	Reason
Particle as noun	သည် (N)	သည် (P_SENT)	Sentence-final particles aren’t nouns
Missing particle	ကောင်းတယ် (V)	ကောင်း (V) + တယ် (P_SENT)	Segment particles separately
Verb as adjective	ကောင်း (ADJ) in predicate	ကောင်း (V)	Predicative = verb
Wrong particle type	က (P)	က (P_SUBJ)	Use specific particle tags when available

Quick Start

Default Configuration (Rule-Based)

No setup required - works out of the box with zero dependencies:

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider

# Uses default rule-based tagger
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)
result = checker.check("မြန်မာ နိုင်ငံ")

Upgrading to Transformer (High Accuracy)

Install transformers package and configure:

# Install with transformer support
pip install myspellchecker[transformers]

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="transformer",
        device=0,  # GPU (use -1 for CPU)
    )
)

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

Using Custom Models

Point to your fine-tuned HuggingFace model:

from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig

config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="transformer",
        model_name="your-username/your-myanmar-pos-model",
        device=-1,  # CPU
    )
)

Tagger Types

1. Rule-Based (Default)

Best for: Quick setup, no dependencies, production environments with tight resource constraints Characteristics:

Fast suffix-based morphological analysis
Produces granular particle tags (P_SUBJ, P_OBJ, P_SENT, P_MOD, P_LOC)
No external dependencies
Fork-safe for multiprocessing
Lowest memory footprint

Performance:

Speed: Very Fast
Accuracy: ~70%
Memory: Very Low
Dependencies: None

How it works:

from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger

tagger = RuleBasedPOSTagger(
    use_morphology_fallback=True,
    cache_size=10000,
    unknown_tag="UNK"
)

tag = tagger.tag_word("စားပြီ")  # Returns: P_SENT

Fallback chain:

Check pos_map (if provided)
Morphological suffix analysis
Return “UNK” for unknown words

2. Transformer (Highest Accuracy)

Best for: Maximum accuracy, when GPU is available, offline processing Characteristics:

Pre-trained neural models from HuggingFace
Context-aware sequence tagging
Produces coarse particle tags (PPM, PART) — mapped from HF lowercase tags
Requires GPU for optimal speed
Not fork-safe (CUDA limitations)

Performance:

Speed: Slow (CPU), Fast (GPU)
Accuracy: ~93%
Memory: ~500 MB (model) + ~100 MB (buffer)
Dependencies: transformers>=4.30.0, torch>=2.0.0

Default model: chuuhtetnaing/myanmar-pos-model (XLM-RoBERTa-based, 93.37% accuracy) How it works:

from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger

tagger = TransformerPOSTagger(
    model_name="chuuhtetnaing/myanmar-pos-model",
    device=0,  # GPU
    batch_size=32,
    max_length=128
)

# Single word
tag = tagger.tag_word("မြန်မာ")  # Returns: N

# Sequence (context-aware)
tags = tagger.tag_sequence(["မြန်မာ", "နိုင်ငံ", "သည်"])
# Returns: ['N', 'N', 'PPM']  (coarse particle tag)

# With confidence scores
prediction = tagger.tag_word_with_confidence("ကောင်း")
print(f"{prediction.tag} (confidence: {prediction.confidence:.2f})")
# Output: ADJ (confidence: 0.95)

3. Viterbi HMM

Best for: Context-aware tagging without GPU, balanced accuracy/speed Characteristics:

Hidden Markov Model with Viterbi algorithm
Uses trigram transition probabilities
Requires pre-built probability tables
Fork-safe

Performance:

Speed: Fast
Accuracy: ~85% (with probability tables), ~70% (fallback to morphology)
Memory: ~50 MB (probability tables)
Dependencies: None (pure Python + optional Cython)

Database Requirements: The Viterbi tagger requires POS probability tables in the database:

Table	Description
`pos_unigrams`	P(tag) - Prior tag probabilities
`pos_bigrams`	P(tag2 \| tag1) - Tag transition probabilities
`pos_trigrams`	P(tag3 \| tag1, tag2) - Trigram context

Building database with POS probabilities:

# Build with POS tagging enabled (populates probability tables)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer

# Or use sample database (includes pre-computed probabilities)
myspellchecker build --sample

Note: If probability tables are empty, Viterbi falls back to morphological analysis with reduced accuracy (~70%).

How it works:

from myspellchecker.algorithms.pos_tagger_factory import POSTaggerFactory
from myspellchecker.providers import SQLiteProvider

# Requires provider with POS probability tables
provider = SQLiteProvider("mydict.db")

tagger = POSTaggerFactory.create("viterbi", provider=provider, beam_width=10)

# Context-aware sequence tagging
tags = tagger.tag_sequence(["မြန်မာ", "နိုင်ငံ", "သည်"])

4. Custom Tagger

Best for: Domain-specific requirements, research experiments Implement your own tagger by inheriting from POSTaggerBase:

from myspellchecker.algorithms.pos_tagger_base import POSTaggerBase, TaggerType

class MyCustomTagger(POSTaggerBase):
    def tag_word(self, word: str) -> str:
        # Your logic here
        return "N"

    def tag_sequence(self, words: list[str]) -> list[str]:
        # Your logic here
        return ["N"] * len(words)

    @property
    def tagger_type(self) -> TaggerType:
        return TaggerType.CUSTOM

# Use via factory
from myspellchecker.algorithms.pos_tagger_factory import POSTaggerFactory
from myspellchecker.core.config import POSTaggerConfig

tagger = POSTaggerFactory.create("custom", provider=provider)

Configuration

POSTaggerConfig

Central configuration for POS tagger system:

from myspellchecker.core.config import POSTaggerConfig

config = POSTaggerConfig(
    # Tagger selection
    tagger_type="transformer",  # "rule_based" | "transformer" | "viterbi"

    # Transformer settings
    model_name="chuuhtetnaing/myanmar-pos-model",
    device=-1,  # -1=CPU, 0+=GPU index
    batch_size=32,
    cache_dir=None,  # Model cache directory

    # Rule-based settings
    cache_size=10000,
    unknown_tag="UNK",
    use_morphology_fallback=True,

    # Viterbi settings
    beam_width=10,
    emission_weight=1.2,
    min_prob=1e-10,
)

Environment Variables

Configure via environment variables (useful for deployment):

# Tagger type
export MYSPELL_POS_TAGGER_TYPE="transformer"

# Model selection
export MYSPELL_POS_TAGGER_MODEL_NAME="your-username/model"

# Beam width for Viterbi tagger
export MYSPELL_POS_TAGGER_BEAM_WIDTH="15"

Configuration Priority

Explicit config in code (highest priority)
Environment variables
Default values (lowest priority)

Build-Time Usage

CLI - Building Dictionaries

Default (Rule-Based)

myspellchecker build \
  -i corpus.txt \
  -o mydict.db \
  --sample=false

With Transformer Tagger

myspellchecker build \
  -i corpus.txt \
  -o mydict.db \
  --pos-tagger transformer \
  --pos-model chuuhtetnaing/myanmar-pos-model \
  --pos-device 0 \
  --sample=false

With Custom Model

myspellchecker build \
  -i corpus.txt \
  -o mydict.db \
  --pos-tagger transformer \
  --pos-model /path/to/my/finetuned/model \
  --pos-device -1 \
  --sample=false

Python API - Building Dictionaries

from myspellchecker.data_pipeline.pipeline import Pipeline
from myspellchecker.data_pipeline.config import PipelineConfig
from myspellchecker.core.config import POSTaggerConfig

# Configure pipeline with POS tagger
config = PipelineConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="transformer",
        model_name="chuuhtetnaing/myanmar-pos-model",
        device=0,  # GPU
        batch_size=64,  # Larger batch for build-time
    ),
    keep_intermediate=False,
)

# Build database
pipeline = Pipeline(config=config, work_dir="temp_build")
pipeline.build_database(
    input_files=["corpus1.txt", "corpus2.txt"],
    database_path="mydict.db",
    sample=False,
)

Runtime Usage

SpellChecker Configuration

Default (Rule-Based)

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider

# Explicit provider (required — no bundled database)
provider = SQLiteProvider(database_path="mydict.db")
checker = SpellChecker(provider=provider)

With Transformer

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig
from myspellchecker.providers import SQLiteProvider
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger

# Create transformer tagger
tagger = TransformerPOSTagger(
    model_name="chuuhtetnaing/myanmar-pos-model",
    device=0,  # GPU
)

# Create provider with tagger
provider = SQLiteProvider(database_path="mydict.db", pos_tagger=tagger)

# Create config and spell checker
config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(tagger_type="transformer", device=0)
)
checker = SpellChecker(config=config, provider=provider)

# Use spell checker
result = checker.check("မြန်မာ နိုင်ငံ ကောင်း သည်")

OOV Word Handling

The POS tagger provides fallback for out-of-vocabulary words:

from myspellchecker.providers import SQLiteProvider
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger

provider = SQLiteProvider(database_path="mydict.db")

# For known word: database lookup
pos = provider.get_word_pos("မြန်မာ")  # Returns: N (from database)

# For OOV word: tagger fallback
pos = provider.get_word_pos("CompletelyUnknownWord123")  # Returns: UNK

Fallback chain:

Database lookup
Stemming + root lookup
POS tagger
Morphology analyzer (backward compatibility)
Return None or “UNK”

Performance Comparison

Comparison

Tagger	Speed	Accuracy	Memory	Context-Aware
Rule-Based	Very Fast	~70%	Very Low	No
Viterbi	Fast	~85%	Low	Yes
Transformer	Slow (CPU) / Fast (GPU)	~93%	High	Yes

Recommendation Matrix

Use Case	Recommended Tagger	Reason
Production API	Rule-Based or Viterbi	Fast, low memory, no GPU needed
Batch Processing	Transformer (GPU)	Highest accuracy, GPU parallelism
Offline Analysis	Transformer (CPU)	Accuracy over speed
Embedded Systems	Rule-Based	Minimal footprint
Research	Transformer or Custom	Flexibility and accuracy

Troubleshooting

Missing Dependencies

Error: ImportError: transformers required Solution:

pip install myspellchecker[transformers]
# Or manually:
pip install transformers>=4.30.0 torch>=2.0.0

Verification:

try:
    from transformers import pipeline
    print("Transformers installed")
except ImportError:
    print("Transformers not installed")

CUDA Errors

Error: RuntimeError: CUDA out of memory Solutions:

Reduce batch size:

config = POSTaggerConfig(
    tagger_type="transformer",
    batch_size=8,  # Reduce from default 32
)

Use CPU:

config = POSTaggerConfig(
    tagger_type="transformer",
    device=-1,  # Force CPU
)

Clear GPU cache:

import torch
torch.cuda.empty_cache()

Error: RuntimeError: CUDA error: device-side assert triggered Solution: Usually model/data mismatch. Verify:

tagger = TransformerPOSTagger(device=0)
# Ensure input is valid Myanmar Unicode text
tag = tagger.tag_word("မြန်မာ")  # Valid
# tag = tagger.tag_word(None)  # Invalid - will crash

Model Loading Failures

Error: OSError: Can't load model from 'nonexistent/model' Solutions:

Verify model exists:

# Check HuggingFace model
curl -I https://huggingface.co/chuuhtetnaing/myanmar-pos-model

# Or use local path
ls /path/to/my/model/config.json

Check internet connection (for HuggingFace downloads):

import requests
response = requests.get("https://huggingface.co")
print(f"Status: {response.status_code}")

Use cache directory:

config = POSTaggerConfig(
    tagger_type="transformer",
    cache_dir="/path/to/cache",  # Persistent cache
)

Download manually:

# Download model to local directory
huggingface-cli download chuuhtetnaing/myanmar-pos-model --local-dir ./my-model

# Use local path
python -c "
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger
tagger = TransformerPOSTagger(model_name='./my-model')
"

Fork-Safety Issues

Error: RuntimeError: Cannot re-initialize CUDA in forked subprocess Cause: Transformer models use CUDA which is not fork-safe. Solution: Use rule-based or Viterbi tagger for multiprocessing:

from multiprocessing import Pool
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger

# Fork-safe
tagger = RuleBasedPOSTagger()

def process_batch(words):
    return [tagger.tag_word(w) for w in words]

with Pool(4) as pool:
    results = pool.map(process_batch, batches)

# NOT fork-safe
# tagger = TransformerPOSTagger()  # Will crash in forked processes

Alternative: Use spawn instead of fork:

from multiprocessing import get_context

with get_context("spawn").Pool(4) as pool:
    results = pool.map(process_batch, batches)

Performance Issues

Slow tagging with transformer:

Use GPU:

config = POSTaggerConfig(device=0)  # GPU 0

Increase batch size:

config = POSTaggerConfig(batch_size=64)  # Default: 32

Use quantization (trade accuracy for speed):

# Requires torch>=2.0
from transformers import AutoModelForTokenClassification
import torch

model = AutoModelForTokenClassification.from_pretrained(
    "chuuhtetnaing/myanmar-pos-model"
)
model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

Joint Segmentation and Tagging

Overview

Joint segmentation and tagging is an advanced mode that performs word segmentation and POS tagging simultaneously in a single Viterbi pass. This is different from the default sequential approach where text is first segmented, then tagged. Default Behavior (Sequential Mode):

Input Text -> Segmenter -> Words -> POS Tagger -> Tagged Words

Joint Mode:

Input Text -> Joint Viterbi Decoder -> Words + Tags (simultaneously)

Why It’s Disabled by Default

Joint mode is disabled by default (config.joint.enabled=False) for several important reasons:

Reason	Explanation
Increased Complexity	State space is O(positions x word_lengths x tags^2) vs O(words x tags^2) for sequential
Higher Memory Usage	Beam search over joint state space requires more memory
Less Tested	Sequential pipeline has more extensive production testing
Similar Accuracy	For most use cases, sequential mode achieves comparable results
Startup Overhead	Joint mode requires loading additional probability tables

When to Enable Joint Mode

Joint mode may provide benefits in specific scenarios:

Use Case	Benefit	Enable Joint?
Ambiguous segmentation	POS context helps resolve word boundaries	Yes
OOV-heavy text	Joint optimization handles unknown words better	Yes
Research/Experiments	Comparing segmentation approaches	Yes
Production API	Latency-sensitive, well-segmented text	No
Simple validation	Basic spell checking	No

Configuration

Enable Joint Mode

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, JointConfig
from myspellchecker.providers import SQLiteProvider

# Enable joint segmentation-tagging
config = SpellCheckerConfig(
    joint=JointConfig(
        enabled=True,
        beam_width=15,  # Larger beam for joint state space
        max_word_length=20,
        emission_weight=1.2,
        word_score_weight=1.0,
    )
)

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

Using SpellCheckerBuilder

from myspellchecker.core.builder import SpellCheckerBuilder

checker = (
    SpellCheckerBuilder()
    .with_joint_segmentation(enabled=True)
    .build()
)

# Perform joint segmentation and tagging
words, tags = checker.segment_and_tag("မြန်မာနိုင်ငံ")
print(list(zip(words, tags)))
# Output: [('မြန်မာ', 'N'), ('နိုင်ငံ', 'N')]

JointConfig Parameters

Parameter	Type	Default	Description
`enabled`	bool	`False`	Enable joint segmentation-tagging mode
`beam_width`	int	`15`	Beam width for Viterbi decoding (larger = more accurate, slower)
`max_word_length`	int	`20`	Maximum word length in characters
`emission_weight`	float	`1.2`	Weight for P(tag \| word) emission probabilities
`word_score_weight`	float	`1.0`	Weight for word n-gram scores
`min_prob`	float	`1e-10`	Minimum probability threshold to prevent underflow
`use_morphology_fallback`	bool	`True`	Use morphology analyzer for OOV word tagging

Performance Comparison

Mode	Speed	Memory	Best For
Sequential	Fast	Low	Production, latency-sensitive
Joint	Moderate	Higher	Ambiguous text, research

Note: Performance varies based on text complexity and hardware.

Usage Example

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, JointConfig

# Sequential mode (default)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
sequential_checker = SpellChecker(provider=provider)
words_seq, tags_seq = sequential_checker.segment_and_tag("မြန်မာနိုင်ငံသည်ကောင်းသည်")

# Joint mode
joint_config = SpellCheckerConfig(
    joint=JointConfig(enabled=True)
)
joint_checker = SpellChecker(config=joint_config)
words_joint, tags_joint = joint_checker.segment_and_tag("မြန်မာနိုင်ငံသည်ကောင်းသည်")

# Compare results
print(f"Sequential: {list(zip(words_seq, tags_seq))}")
print(f"Joint: {list(zip(words_joint, tags_joint))}")

Technical Details

The joint decoder uses a unified Viterbi algorithm that optimizes:

argmax P(words, tags | text)
  = argmax P(word_i) x P(tag_i | tag_{i-1}, tag_{i-2}) x P(tag_i | word_i)

State representation: (position, word_start, current_tag, prev_tag) Scoring components:

Word score: log P(word | prev_word) - N-gram language model
Transition score: log P(tag | prev_tags) - POS tag sequence model
Emission score: log P(tag | word) - Word-to-tag emission probability

Limitations

Requires probability tables: Joint mode needs bigram/trigram probabilities in the database
Not all segmenters support it: Only JointSegmentTagger implements joint mode
Base segmenters raise NotImplementedError: Individual segmenters don’t support joint mode; use SpellChecker.segment_and_tag() instead

Advanced Topics

Fine-Tuning Custom Models

Train your own Myanmar POS tagger on domain-specific data:

# 1. Prepare training data (word, POS tag pairs)
training_data = [
    ("မြန်မာ", "N"),
    ("နိုင်ငံ", "N"),
    ("ကောင်း", "ADJ"),
    # ... more examples
]

# 2. Use HuggingFace Trainer (example)
from transformers import (
    AutoModelForTokenClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)

model = AutoModelForTokenClassification.from_pretrained(
    "xlm-roberta-base",
    num_labels=len(pos_tags)
)

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

# 3. Train (simplified)
trainer = Trainer(
    model=model,
    args=TrainingArguments(output_dir="./my-myanmar-pos"),
    train_dataset=train_dataset,
)
trainer.train()

# 4. Use your model
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger
tagger = TransformerPOSTagger(model_name="./my-myanmar-pos")

Extending with Custom Taggers

Create domain-specific taggers:

from myspellchecker.algorithms.pos_tagger_base import POSTaggerBase, POSPrediction, TaggerType

class DomainSpecificTagger(POSTaggerBase):
    """Medical domain POS tagger."""

    def __init__(self, medical_terms_dict):
        self.medical_terms = medical_terms_dict

    def tag_word(self, word: str) -> str:
        # Check medical terminology first
        if word in self.medical_terms:
            return self.medical_terms[word]

        # Fallback to heuristics
        if word.endswith("ရောဂါ"):
            return "N_DISEASE"

        return "UNK"

    def tag_sequence(self, words: list[str]) -> list[str]:
        return [self.tag_word(w) for w in words]

    @property
    def tagger_type(self) -> TaggerType:
        return TaggerType.CUSTOM

# Usage
medical_dict = {
    "ငှက်ဖျားရောဂါ": "N_DISEASE",
    "ဆေးဝါး": "N_MEDICINE",
}
tagger = DomainSpecificTagger(medical_dict)

Combining Multiple Taggers

Ensemble approach for higher accuracy:

class EnsembleTagger(POSTaggerBase):
    def __init__(self, taggers: list[POSTaggerBase], weights: list[float]):
        self.taggers = taggers
        self.weights = weights

    def tag_word_with_confidence(self, word: str) -> POSPrediction:
        predictions = [
            t.tag_word_with_confidence(word) for t in self.taggers
        ]

        # Weighted voting
        votes = {}
        for pred, weight in zip(predictions, self.weights):
            votes[pred.tag] = votes.get(pred.tag, 0) + weight * pred.confidence

        best_tag = max(votes, key=votes.get)
        confidence = votes[best_tag] / sum(self.weights)

        return POSPrediction(word=word, tag=best_tag, confidence=confidence)

    # ... implement other methods

# Usage
from myspellchecker.algorithms.pos_tagger_rule import RuleBasedPOSTagger
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger

ensemble = EnsembleTagger(
    taggers=[
        RuleBasedPOSTagger(),
        TransformerPOSTagger(),
    ],
    weights=[0.3, 0.7]  # Trust transformer more
)

Caching Strategies

Optimize performance with intelligent caching:

from functools import lru_cache
from myspellchecker.algorithms.pos_tagger_transformer import TransformerPOSTagger

class CachedTransformerTagger(TransformerPOSTagger):
    def __init__(self, *args, cache_size=10000, **kwargs):
        super().__init__(*args, **kwargs)
        self._setup_cache(cache_size)

    def _setup_cache(self, cache_size):
        self._tag_word_cached = lru_cache(maxsize=cache_size)(
            super().tag_word
        )

    def tag_word(self, word: str) -> str:
        return self._tag_word_cached(word)

# Usage - 10x speedup for repeated words
tagger = CachedTransformerTagger(cache_size=50000)

Acknowledgments

Transformer POS Model

The default transformer-based POS tagger uses the myanmar-pos-model by Chuu Htet Naing:

Attribute	Value
Model	chuuhtetnaing/myanmar-pos-model
Author	Chuu Htet Naing
Base Model	XLM-RoBERTa
Accuracy	93.37%
F1 Score	92.24%
License	Please refer to the model’s Hugging Face page for license information

This model was trained specifically for Myanmar/Burmese Part-of-Speech tagging and provides state-of-the-art accuracy for the language. Citation: If you use the transformer POS tagger in your research, please cite the original model:

@misc{chuuhtetnaing-myanmar-pos,
  author = {Chuu Htet Naing},
  title = {Myanmar POS Model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/chuuhtetnaing/myanmar-pos-model}
}

We express our gratitude to Chuu Htet Naing for making this model publicly available, which significantly enhances the accuracy of Myanmar language processing in mySpellChecker.

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​Introduction

​What is POS Tagging?

​Why Use POS Tagging?

​Integration Points

​Supported Tags

​Two Tag Systems

​Complete POS Tag Set

​Core Tags (all systems)

​Granular Particle Tags (inference engine only)

​Coarse / Transformer-Only Tags

​Transformer Tag Mapping

​Tag Disambiguation Guidelines

​1. Noun vs. Verb Ambiguity

​2. Adjective vs. Verb Ambiguity

​3. Particle Disambiguation

​Annotation Guidelines

​Common Annotation Errors to Avoid

​Quick Start

​Default Configuration (Rule-Based)

​Upgrading to Transformer (High Accuracy)

​Using Custom Models

​Tagger Types

​1. Rule-Based (Default)

​2. Transformer (Highest Accuracy)

​3. Viterbi HMM

​4. Custom Tagger

​Configuration

​POSTaggerConfig

​Environment Variables

​Configuration Priority

​Build-Time Usage

​CLI - Building Dictionaries

​Default (Rule-Based)

​With Transformer Tagger

​With Custom Model

​Python API - Building Dictionaries

​Runtime Usage

​SpellChecker Configuration

​Default (Rule-Based)

​With Transformer

​OOV Word Handling

​Performance Comparison

​Comparison

​Recommendation Matrix

​Troubleshooting

​Missing Dependencies

​CUDA Errors

​Model Loading Failures

​Fork-Safety Issues

​Performance Issues

​Joint Segmentation and Tagging

​Overview

​Why It’s Disabled by Default

​When to Enable Joint Mode

​Configuration

​Enable Joint Mode

​Using SpellCheckerBuilder

​JointConfig Parameters

​Performance Comparison

​Usage Example

​Technical Details

​Limitations

​Advanced Topics

​Fine-Tuning Custom Models

​Extending with Custom Taggers

​Combining Multiple Taggers

​Caching Strategies

​Acknowledgments

​Transformer POS Model

Introduction

What is POS Tagging?

Why Use POS Tagging?

Integration Points

Supported Tags

Two Tag Systems

Complete POS Tag Set

Core Tags (all systems)

Granular Particle Tags (inference engine only)

Coarse / Transformer-Only Tags

Transformer Tag Mapping

Tag Disambiguation Guidelines

1. Noun vs. Verb Ambiguity

2. Adjective vs. Verb Ambiguity

3. Particle Disambiguation

Annotation Guidelines

Common Annotation Errors to Avoid

Quick Start

Default Configuration (Rule-Based)

Upgrading to Transformer (High Accuracy)

Using Custom Models

Tagger Types

1. Rule-Based (Default)

2. Transformer (Highest Accuracy)

3. Viterbi HMM

4. Custom Tagger

Configuration

POSTaggerConfig

Environment Variables

Configuration Priority

Build-Time Usage

CLI - Building Dictionaries

Default (Rule-Based)

With Transformer Tagger

With Custom Model

Python API - Building Dictionaries

Runtime Usage

SpellChecker Configuration

Default (Rule-Based)

With Transformer

OOV Word Handling

Performance Comparison

Comparison

Recommendation Matrix

Troubleshooting

Missing Dependencies

CUDA Errors

Model Loading Failures

Fork-Safety Issues

Performance Issues

Joint Segmentation and Tagging

Overview

Why It’s Disabled by Default

When to Enable Joint Mode

Configuration

Enable Joint Mode

Using SpellCheckerBuilder

JointConfig Parameters

Performance Comparison

Usage Example

Technical Details

Limitations

Advanced Topics

Fine-Tuning Custom Models

Extending with Custom Taggers

Combining Multiple Taggers

Caching Strategies

Acknowledgments

Transformer POS Model

See Also