N-gram Algorithm

N-gram models capture the likelihood of word sequences, enabling detection of “real-word errors” where a correctly spelled word is used incorrectly in context.

Overview

N-gram models capture the likelihood of word sequences, enabling detection of “real-word errors” — correctly spelled words used incorrectly in context.

How It Works

Bigram Model (2-gram)

Calculates the probability of word pairs:

P(w2 | w1) = count(w1, w2) / count(w1)

# Example
P("စား" | "ထမင်း") = count("ထမင်း စား") / count("ထမင်း")
                    = 8500 / 10000
                    = 0.85

Trigram Model (3-gram)

Extends to word triplets for richer context:

P(w3 | w1, w2) = count(w1, w2, w3) / count(w1, w2)

Smoothing Strategies

The library supports multiple smoothing strategies for handling unseen N-grams:

Stupid Backoff (Default)

Fast and effective for most use cases:

P_backoff(w2 | w1) =
    P(w2 | w1)              if count(w1, w2) > 0
    α * P(w2)               otherwise

# Where α is the backoff weight (default: 0.4)

Add-K (Laplace) Smoothing

Adds constant k to all counts:

P_smooth(w2 | w1) = (count(w1, w2) + k) / (count(w1) + k * V)

# Where V is vocabulary size, k is smoothing constant

No Smoothing

Returns raw probabilities (for pre-smoothed data):

P(w2 | w1) = count(w1, w2) / count(w1)
# Returns 0 for unseen pairs

Configuration

from myspellchecker.algorithms.ngram_context_checker import (
    NgramContextChecker,
    SmoothingStrategy,
)

checker = NgramContextChecker(
    provider=provider,
    threshold=0.01,                # Bigram error threshold (flag if P < threshold)
    trigram_threshold=0.005,       # Trigram error threshold (used when raw trigram exists)

    # Smoothing
    smoothing_strategy=SmoothingStrategy.STUPID_BACKOFF,
    backoff_weight=0.4,
    add_k_smoothing=0.0,           # For ADD_K strategy (0.0 = disabled by default)
)

Note on defaults: NgramContextChecker and NgramContextConfig have different defaults for some parameters. When using NgramContextConfig (e.g., via SpellCheckerConfig), the config defaults apply: threshold=0.01 (from AlgorithmDefaults.NGRAM_THRESHOLD), trigram_threshold=0.0001, edit_distance_weight=0.6, probability_weight=0.4. When constructing NgramContextChecker directly, the class defaults apply: threshold=0.01, trigram_threshold=0.005, edit_distance_weight=0.3, probability_weight=0.7. The config-based path is recommended for consistency.

Error Detection

The checker uses a two-path detection strategy based on raw trigram availability:

Trigram path: When a raw trigram probability exists in the corpus (P_raw(w3|w1,w2) > 0), the checker uses the trigram-specific threshold (trigram_threshold, default 0.005) to determine if the word is an error. This avoids false positives from smoothed backoff values.
Bigram fallback path: When no raw trigram is found, the checker falls back to bigram probabilities with bidirectional context checking, unigram backoff for common words, and typo neighbor detection via SymSpell.

This design ensures that smoothed backoff values (which are always > 0) do not gate the checker into the trigram path, keeping the bigram heuristics reachable.

# Input: "ထမင်းသွား" (Rice go)
# P("သွား" | "ထမင်း") = 0.0001  # Very low
# P("စား" | "ထမင်း") = 0.85    # High

# → Flag "သွား", suggest "စား"

Suggestion Generation

Suggestions are generated by:

Finding words with higher conditional probability
Filtering by edit distance (max 2)
Ranking by combined probability and distance score

suggestions = checker.suggest(
    prev_word="ထမင်း",
    current_word="သွား",
    max_edit_distance=2,
    next_word="ပြီ",
)

Performance

Operation	Complexity	Typical Time
Bigram lookup	O(1)	<1ms
Trigram lookup	O(1)	<1ms
Context analysis	O(n)	~100ms for avg text
Suggestion generation	O(k)	~50ms

Database Schema

N-gram data is stored in SQLite using INTEGER foreign keys referencing the words table (not TEXT columns):

-- Bigrams (word IDs reference words.id)
CREATE TABLE bigrams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1_id INTEGER,
    word2_id INTEGER,
    probability REAL DEFAULT 0.0,
    count INTEGER DEFAULT 0,
    FOREIGN KEY(word1_id) REFERENCES words(id),
    FOREIGN KEY(word2_id) REFERENCES words(id),
    UNIQUE(word1_id, word2_id)
);

-- Trigrams (word IDs reference words.id)
CREATE TABLE trigrams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    word1_id INTEGER,
    word2_id INTEGER,
    word3_id INTEGER,
    probability REAL DEFAULT 0.0,
    count INTEGER DEFAULT 0,
    FOREIGN KEY(word1_id) REFERENCES words(id),
    FOREIGN KEY(word2_id) REFERENCES words(id),
    FOREIGN KEY(word3_id) REFERENCES words(id),
    UNIQUE(word1_id, word2_id, word3_id)
);

To query bigrams for a specific word, use a JOIN:

SELECT w1.word, w2.word, b.probability
FROM bigrams b
JOIN words w1 ON b.word1_id = w1.id
JOIN words w2 ON b.word2_id = w2.id
WHERE w1.word = 'ထမင်း'
ORDER BY b.probability DESC;

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Overview

How It Works

Bigram Model (2-gram)

Trigram Model (3-gram)

Smoothing Strategies

Stupid Backoff (Default)

Add-K (Laplace) Smoothing

No Smoothing

Configuration

Error Detection

Suggestion Generation

Performance

Database Schema

See Also

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Overview

​How It Works

​Bigram Model (2-gram)

​Trigram Model (3-gram)

​Smoothing Strategies

​Stupid Backoff (Default)

​Add-K (Laplace) Smoothing

​No Smoothing

​Configuration

​Error Detection

​Suggestion Generation

​Performance

​Database Schema

​See Also

Overview

How It Works

Bigram Model (2-gram)

Trigram Model (3-gram)

Smoothing Strategies

Stupid Backoff (Default)

Add-K (Laplace) Smoothing

No Smoothing

Configuration

Error Detection

Suggestion Generation

Performance

Database Schema

See Also