Skip to main content
This page covers the implementation details, database requirements, beam search tuning, and Cython acceleration for the Viterbi decoder used in mySpellChecker’s POS tagging and joint segmentation pipeline.

Overview

The Viterbi algorithm finds the most likely sequence of hidden states (POS tags) given observed emissions (words). It uses dynamic programming for O(n) efficiency.

Requirements

The Viterbi POS tagger requires probability tables in the database to function effectively:
TablePurposeRequired?
pos_unigramsP(tag) - Tag frequency distributionRecommended
pos_bigramsP(tag₂ | tag₁) - Tag transition probabilitiesRequired
pos_trigramsP(tag₃ | tag₁, tag₂) - Trigram transitionsRecommended

Building Database with POS Probabilities

POS probability tables are populated when you build a database with a POS tagger:
# Build with Viterbi tagger (uses rule-based for initial tagging)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger viterbi

# Build with transformer tagger (higher accuracy, requires GPU)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer

# Build sample database (includes pre-computed POS probabilities)
myspellchecker build --sample

Checking Database for POS Tables

import sqlite3

conn = sqlite3.connect("dict.db")
cursor = conn.cursor()

# Check POS table counts
for table in ["pos_unigrams", "pos_bigrams", "pos_trigrams"]:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"{table}: {count} entries")

conn.close()

Fallback Behavior

If POS probability tables are empty or missing:
  1. Viterbi falls back to morphological analysis for tag prediction
  2. Accuracy drops from ~85% to ~70% (similar to rule-based tagger)
  3. A warning is logged: “Provider does not support bigram probabilities”

How It Works

Hidden Markov Model

States: POS tags (NOUN, VERB, PARTICLE, ...)
Observations: Words/syllables
Transitions: P(tag_i | tag_{i-1})
Emissions: P(word | tag)

Algorithm Steps

  1. Initialization: Set probabilities for first position
  2. Recursion: For each position, compute best path to each state
  3. Termination: Find best final state
  4. Backtracking: Reconstruct optimal path
# Pseudocode
for t in range(1, T):
    for s in states:
        viterbi[t][s] = max(
            viterbi[t-1][s'] * transition[s'][s] * emission[s][word[t]]
            for s' in states
        )

Implementation

Python Wrapper

from myspellchecker.algorithms.viterbi import ViterbiTagger

tagger = ViterbiTagger(
    provider=provider,
    pos_bigram_probs=bigram_probs,      # P(t2|t1) transition probabilities
    pos_trigram_probs=trigram_probs,     # P(t3|t1,t2) trigram transitions
    pos_unigram_probs=unigram_probs,    # P(t) unigram probabilities (optional)
    beam_width=10,                       # Beam search pruning (default: 10)
)

tags = tagger.tag_sequence(["ကျွန်တော်", "သွားပါမယ်"])
# Output: ["PRON", "VERB"]

Cython Acceleration

The library includes a Cython implementation for 10x speedup:
# Automatically uses Cython if available
from myspellchecker.algorithms.viterbi import ViterbiTagger

# Check if Cython is loaded
from myspellchecker.algorithms.viterbi import _HAS_CYTHON_VITERBI
print(_HAS_CYTHON_VITERBI)  # True if Cython extension loaded

Configuration

from myspellchecker.core.config import POSTaggerConfig

config = POSTaggerConfig(
    tagger_type="viterbi",      # Use Viterbi tagger
    beam_width=10,              # Beam search width (default: 10)
)
Note: Unknown word handling is controlled via the min_prob (default: 1e-10) and unknown_word_tag (default: "UNK") parameters on the ViterbiTagger constructor, not via a separate penalty parameter.
Beam search prunes unlikely paths for efficiency:
# Instead of tracking all states, keep top-k
beam_width = 10  # Default

# At each step, only keep top beam_width paths
# Reduces complexity from O(|S|^T) to O(k * |S| * T)
Beam WidthSpeedAccuracy
5Fast~83%
10Balanced (default)~85%
20+Slow~86%

Joint Segmentation + Tagging

The Viterbi algorithm can jointly segment and tag text:
from myspellchecker.algorithms.joint_segment_tagger import JointSegmentTagger

tagger = JointSegmentTagger(
    provider=provider,
    beam_width=15,
    word_score_weight=1.0,
)

segments, tags = tagger.segment_and_tag("ကျွန်တော်သွားပါမယ်")
# segments: ["ကျွန်တော်", "သွားပါမယ်"]
# tags: ["PRON", "VERB"]
See Joint Segment Tagger for details.

Emission Score Logic

For the first token (t=1), emission scores are handled specially:
# At t=1, emission score = P(word | tag)
# No transition from previous state

# For t>1, score = transition * emission
score = transition[prev_tag][tag] * emission[tag][word]

Performance

OperationCythonPythonSpeedup
Tag 10 words0.5ms5ms10x
Tag 100 words3ms30ms10x
Tag 1000 words25ms250ms10x

Integration

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, POSTaggerConfig

config = SpellCheckerConfig(
    pos_tagger=POSTaggerConfig(
        tagger_type="viterbi",
        beam_width=10,
    )
)

checker = SpellChecker(config=config)

See Also