Viterbi Algorithm

This page covers the implementation details, database requirements, beam search tuning, and Cython acceleration for the Viterbi decoder used in mySpellChecker’s POS tagging and joint segmentation pipeline.

Overview

The Viterbi algorithm finds the most likely sequence of hidden states (POS tags) given observed emissions (words). It uses dynamic programming for O(n) efficiency.

Requirements

The Viterbi POS tagger requires probability tables in the database to function effectively:

Table	Purpose	Required?
`pos_unigrams`	P(tag) - Tag frequency distribution	Recommended
`pos_bigrams`	P(tag₂ \| tag₁) - Tag transition probabilities	Required
`pos_trigrams`	P(tag₃ \| tag₁, tag₂) - Trigram transitions	Recommended

Building Database with POS Probabilities

POS probability tables are populated when you build a database with a POS tagger:

# Build with Viterbi tagger (uses rule-based for initial tagging)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger viterbi

# Build with transformer tagger (higher accuracy, requires GPU)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer

# Build sample database (includes pre-computed POS probabilities)
myspellchecker build --sample

Checking Database for POS Tables

import sqlite3

conn = sqlite3.connect("dict.db")
cursor = conn.cursor()

# Check POS table counts
for table in ["pos_unigrams", "pos_bigrams", "pos_trigrams"]:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"{table}: {count} entries")

conn.close()

Fallback Behavior

If POS probability tables are empty or missing:

Viterbi falls back to morphological analysis for tag prediction
Accuracy drops from ~85% to ~70% (similar to rule-based tagger)
A warning is logged: “Provider does not support bigram probabilities”

How It Works

Hidden Markov Model

States: POS tags (NOUN, VERB, PARTICLE, ...)
Observations: Words/syllables
Transitions: P(tag_i | tag_{i-1})
Emissions: P(word | tag)

Algorithm Steps

Initialization: Set probabilities for first position
Recursion: For each position, compute best path to each state
Termination: Find best final state
Backtracking: Reconstruct optimal path

# Pseudocode
for t in range(1, T):
    for s in states:
        viterbi[t][s] = max(
            viterbi[t-1][s'] * transition[s'][s] * emission[s][word[t]]
            for s' in states
        )

Implementation

Python Wrapper

from myspellchecker.algorithms.viterbi import ViterbiTagger

tagger = ViterbiTagger(
    provider=provider,
    pos_bigram_probs=bigram_probs,      # P(t2|t1) transition probabilities
    pos_trigram_probs=trigram_probs,     # P(t3|t1,t2) trigram transitions
    pos_unigram_probs=unigram_probs,    # P(t) unigram probabilities (optional)
    beam_width=10,                       # Beam search pruning (default: 10)
)

tags = tagger.tag_sequence(["ကျွန်တော်", "သွားပါမယ်"])
# Output: ["PRON", "VERB"]

Cython Acceleration

The library includes a Cython implementation for 10x speedup:

# Automatically uses Cython if available
from myspellchecker.algorithms.viterbi import ViterbiTagger

# Check if Cython is loaded
from myspellchecker.algorithms.viterbi import _HAS_CYTHON_VITERBI
print(_HAS_CYTHON_VITERBI)  # True if Cython extension loaded

Configuration

from myspellchecker.core.config import POSTaggerConfig

config = POSTaggerConfig(
    tagger_type="viterbi",      # Use Viterbi tagger
    beam_width=10,              # Beam search width (default: 10)
)

Note: Unknown word handling is controlled via the min_prob (default: 1e-10) and unknown_word_tag (default: "UNK") parameters on the ViterbiTagger constructor, not via a separate penalty parameter.

Beam Search

Beam search prunes unlikely paths for efficiency:

# Instead of tracking all states, keep top-k
beam_width = 10  # Default

# At each step, only keep top beam_width paths
# Reduces complexity from O(|S|^T) to O(k * |S| * T)

Beam Width	Speed	Accuracy
5	Fast	~83%
10	Balanced (default)	~85%
20+	Slow	~86%

Joint Segmentation + Tagging

The Viterbi algorithm can jointly segment and tag text:

from myspellchecker.algorithms.joint_segment_tagger import JointSegmentTagger

tagger = JointSegmentTagger(
    provider=provider,
    beam_width=15,
    word_score_weight=1.0,
)

segments, tags = tagger.segment_and_tag("ကျွန်တော်သွားပါမယ်")
# segments: ["ကျွန်တော်", "သွားပါမယ်"]
# tags: ["PRON", "VERB"]

See Joint Segment Tagger for details.

Emission Score Logic

For the first token (t=1), emission scores are handled specially:

# At t=1, emission score = P(word | tag)
# No transition from previous state

# For t>1, score = transition * emission
score = transition[prev_tag][tag] * emission[tag][word]

Performance

Operation	Cython	Python	Speedup
Tag 10 words	0.5ms	5ms	10x
Tag 100 words	3ms	30ms	10x
Tag 1000 words	25ms	250ms	10x

Integration

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig

config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="viterbi",
        beam_width=10,
    )
)

checker = SpellChecker(config=config)

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Overview

Requirements

Building Database with POS Probabilities

Checking Database for POS Tables

Fallback Behavior

How It Works

Hidden Markov Model

Algorithm Steps

Implementation

Python Wrapper

Cython Acceleration

Configuration

Beam Search

Joint Segmentation + Tagging

Emission Score Logic

Performance

Integration

See Also

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Overview

​Requirements

​Building Database with POS Probabilities

​Checking Database for POS Tables

​Fallback Behavior

​How It Works

​Hidden Markov Model

​Algorithm Steps

​Implementation

​Python Wrapper

​Cython Acceleration

​Configuration

​Beam Search

​Joint Segmentation + Tagging

​Emission Score Logic

​Performance

​Integration

​See Also

Overview

Requirements

Building Database with POS Probabilities

Checking Database for POS Tables

Fallback Behavior

How It Works

Hidden Markov Model

Algorithm Steps

Implementation

Python Wrapper

Cython Acceleration

Configuration

Beam Search

Joint Segmentation + Tagging

Emission Score Logic

Performance

Integration

See Also