Context Checking - mySpellChecker

Context checking is the third layer of validation that detects “real-word errors” — words that are spelled correctly but used incorrectly in context.

Understanding Real-Word Errors

The Problem

Consider this Myanmar sentence:

ထမင်းသွား — unnatural ("rice" + "go" — not a meaningful combination)

Both “ထမင်း” (rice) and “သွား” (go) are valid words individually. But together, they don’t form a natural phrase. The user likely meant:

ထမင်းစား — "eat rice" / "have a meal" (common, natural phrase)

Standard spell checkers miss these errors because each word is spelled correctly.

The Solution: N-gram Analysis

mySpellChecker uses N-gram probabilities to detect statistically unlikely word combinations:

P("ထမင်း သွား") = 0.0001  # Very unlikely
P("ထမင်း စား") = 0.0850  # Common combination

How It Works

Bigram Model

Analyzes pairs of adjacent words:

text = "ထမင်း သွား ပြီ"
bigrams = [
    ("ထမင်း", "သွား"),  # Check probability
    ("သွား", "ပြီ"),     # Check probability
]

Trigram Model

Analyzes triplets for more context:

text = "သူ ထမင်း သွား ပြီ"
trigrams = [
    ("သူ", "ထမင်း", "သွား"),
    ("ထမင်း", "သွား", "ပြီ"),
]

Probability Threshold

Words below a probability threshold are flagged:

bigram_threshold = 0.0001  # Default bigram threshold

if P(bigram) < bigram_threshold:
    flag_as_context_error()
    suggest_alternatives()

Configuration

Enable Context Checking

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider
from myspellchecker.core.constants import ValidationLevel

# Enable context validation via config
config = SpellCheckerConfig(
    use_context_checker=True,  # Enable N-gram context checking
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

# Use word-level validation for thorough checking
# Context checking is enabled via use_context_checker config, not validation level
result = checker.check(text, level=ValidationLevel.WORD)

Context Settings

from myspellchecker.core.config import SpellCheckerConfig, NgramContextConfig

config = SpellCheckerConfig(
    use_context_checker=True,

    # N-gram context configuration
    ngram_context=NgramContextConfig(
        bigram_threshold=0.0001,  # Bigram-specific threshold
        trigram_threshold=0.0001,  # Trigram-specific threshold

        # Scoring weights (must sum to ~1.0)
        edit_distance_weight=0.6,  # Weight for edit distance in scoring
        probability_weight=0.4,    # Weight for probability in scoring

        # Smoothing configuration
        use_smoothing=True,                    # Enable smoothing (default)
        smoothing_strategy="stupid_backoff",   # Options: 'none', 'stupid_backoff', 'add_k'
        backoff_weight=0.4,                    # Weight for Stupid Backoff
        add_k_smoothing=0.0,                   # Add-k constant (for 'add_k' strategy)

        # Unigram backoff
        unigram_denominator=1000000.0,  # Approximate corpus word count
        unigram_prob_cap=0.1,           # Max unigram probability cap
        min_unigram_threshold=5,        # Min freq for valid-in-unseen-context
    ),
)

NgramContextChecker Configuration

from myspellchecker.algorithms.ngram_context_checker import (
    NgramContextChecker,
    SmoothingStrategy,
)

context_checker = NgramContextChecker(
    provider=provider,
    threshold=0.01,                # Minimum probability threshold

    # Smoothing configuration
    smoothing_strategy=SmoothingStrategy.STUPID_BACKOFF,  # NONE, STUPID_BACKOFF, ADD_K
    backoff_weight=0.4,            # Weight for Stupid Backoff
    add_k_smoothing=0.0,           # Add-k smoothing constant (if using ADD_K)

    # Advanced options
    edit_distance_weight=0.3,      # Weight for edit distance in scoring
    probability_weight=0.7,        # Weight for probability in scoring
)

Smoothing Strategies

The library supports multiple smoothing strategies for handling unseen N-grams:

Strategy	Description	Use Case
`NONE`	No smoothing, raw probabilities	Pre-smoothed data
`STUPID_BACKOFF`	Simple backoff with configurable weight (default)	General use, fast
`ADD_K`	Add-k (Laplace) smoothing	Small vocabularies

from myspellchecker.algorithms.ngram_context_checker import SmoothingStrategy

# Use Stupid Backoff (default, recommended)
checker = NgramContextChecker(
    provider=provider,
    smoothing_strategy=SmoothingStrategy.STUPID_BACKOFF,
    backoff_weight=0.4,  # P(unseen) = 0.4 * P(lower-order)
)

# Use Add-K smoothing
checker = NgramContextChecker(
    provider=provider,
    smoothing_strategy=SmoothingStrategy.ADD_K,
    add_k_smoothing=0.01,  # Add small constant to all counts
)

# Disable smoothing (for pre-smoothed data)
checker = NgramContextChecker(
    provider=provider,
    smoothing_strategy=SmoothingStrategy.NONE,
)

Skipped Context Words

The context validator skips 28 high-frequency Myanmar particles that appear in virtually all contexts and have low discriminative power. These particles are never flagged as context errors:

Subject/object markers: က, ကို, သည်, တယ်
Locative particles: မှာ, မှ, တွင်
Comitative/conjunctive: နဲ့, နှင့်, နှင်
Genitive/possessive: ရဲ့, ၏
Emphasis/interjection: ကွာ, ဗျာ, နော်, ဟေ့, ကွ, လေ, ပါ, ပဲ, ပေါ့
Other common particles: များ, လည်း, တော့, ပြီး, ဖို့, အတွက်
Question particles: လား, လဲ

The full set is defined in core/constants/myanmar_constants.py:SKIPPED_CONTEXT_WORDS.

Context Error Types

Word Substitution

Correct word, wrong context:

# "သွား" (go) instead of "စား" (eat)
result = checker.check("ထမင်းသွား")
# Error: ContextError suggesting "ထမင်းစား" (eat rice)

N-gram Probability Calculation

Bigram Probability

P(w2 | w1) = count(w1, w2) / count(w1)

# Example
P("စား" | "ထမင်း") = count("ထမင်း စား") / count("ထမင်း")
                    = 8500 / 10000
                    = 0.85

Trigram Probability

P(w3 | w1, w2) = count(w1, w2, w3) / count(w1, w2)

# Example with Kneser-Ney smoothing applied

Smoothing

For unseen N-grams, smoothing prevents zero probabilities:

# Stupid Backoff (default) - Fast and effective
# For unseen bigrams: P_backoff = alpha * P(unigram)
# For unseen trigrams: P_backoff = alpha * P(bigram)
P_backoff(w2 | w1) = backoff_weight * P(w2)  # if (w1, w2) unseen

# Add-K (Laplace) smoothing - Simple but can oversmooth
P_smooth(w2 | w1) = (count(w1, w2) + k) / (count(w1) + k * V)

# No smoothing - Use raw probabilities (for pre-smoothed data)
P(w2 | w1) = count(w1, w2) / count(w1)

Configure smoothing via SmoothingStrategy enum:

from myspellchecker.algorithms.ngram_context_checker import SmoothingStrategy

# Options: NONE, STUPID_BACKOFF (default), ADD_K

Performance Characteristics

Metric	Value
Speed	Moderate
Bigram Lookup	O(1)
Trigram Lookup	O(1)
Context Analysis	O(n) where n = word count

Context checking is slower than syllable/word validation because it performs N-gram lookups for each word pair. Memory usage and latency depend on corpus size, database backend, and caching configuration.

API Reference

Using SpellChecker for Context Validation

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

# Enable context checking
config = SpellCheckerConfig(use_context_checker=True)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

# Check text - context validation is applied automatically
result = checker.check("ထမင်း သွား ပြီ")

# Context errors have additional information
for error in result.errors:
    if hasattr(error, 'probability'):
        print(f"Context error: {error.text}")
        print(f"Bigram probability: {error.probability}")
        print(f"Previous word: {error.prev_word}")
        print(f"Suggestions: {error.suggestions}")

Note: Direct instantiation of ContextValidator requires a DI container setup with registered validation strategies. For most use cases, use SpellChecker.check() with use_context_checker=True in the config.

NgramContextChecker

from myspellchecker.algorithms.ngram_context_checker import NgramContextChecker

checker = NgramContextChecker(provider=provider)

# Get smoothed bigram probability
prob = checker.get_smoothed_bigram_probability("ထမင်း", "စား")

# Get smoothed trigram probability
prob = checker.get_smoothed_trigram_probability("သူ", "ထမင်း", "စား")

# Check if a word is a contextual error
is_error = checker.is_contextual_error(
    prev_word="ထမင်း",
    current_word="သွား",
    next_word="ပြီ",
)

# Get context-aware suggestions sorted by score
suggestions = checker.suggest(
    prev_word="ထမင်း",
    current_word="သွား",
    next_word="ပြီ",
)
for s in suggestions:
    print(f"{s.term} (P={s.probability:.3f}, score={s.score:.3f})")

# Analyze an entire sequence
words = ["ထမင်း", "သွား", "ပြီ"]
analysis = checker.analyze_sequence(words)
for pos, prob, is_err in analysis:
    if is_err:
        print(f"Word at {pos} is contextually unlikely (P={prob:.3f})")

Common Patterns

Context-Aware Autocomplete

def get_next_word_suggestions(text: str, max_suggestions: int = 5) -> list:
    """Get likely next words based on context."""
    words = checker.segmenter.segment_words(text)

    if len(words) < 1:
        return []

    # Get continuations based on last word
    last_word = words[-1]
    return checker.provider.get_top_continuations(
        last_word,
        limit=max_suggestions
    )

Detect Uncommon Phrases

def find_unusual_phrases(text: str, provider, segmenter, threshold: float = 0.0001) -> list:
    """Find statistically unusual word combinations."""
    words = segmenter.segment_words(text)
    unusual = []

    for i in range(len(words) - 1):
        bigram = (words[i], words[i + 1])
        prob = provider.get_bigram_probability(*bigram)

        if prob < threshold:
            unusual.append({
                "bigram": bigram,
                "probability": prob,
                "position": i,
            })

    return unusual

Context-Only Validation

from myspellchecker.algorithms.ngram_context_checker import NgramContextChecker

def check_context_only(text: str, provider, segmenter) -> list:
    """Check only context errors, skip syllable/word validation."""
    context_checker = NgramContextChecker(provider=provider)

    # Segment text into words
    words = segmenter.segment_words(text)
    context_errors = []

    # Analyze the full sequence for contextual errors
    analysis = context_checker.analyze_sequence(words)
    for pos, prob, is_error in analysis:
        if is_error:
            # Get suggestions for the flagged word
            prev_word = words[pos - 1]
            current_word = words[pos]
            next_word = words[pos + 1] if pos + 1 < len(words) else None
            suggestions = context_checker.suggest(
                prev_word, current_word, next_word=next_word
            )
            context_errors.append({
                "word": current_word,
                "position": pos,
                "probability": prob,
                "suggestions": [s.term for s in suggestions],
            })

    return context_errors

Limitations

Rare Valid Phrases

Unusual but correct phrases may be flagged:

"ရွှေရောင်နှင်းဆီ" — "golden rose" (poetic, rare in everyday corpus)
"ဒေတာဘေ့စ်ဆာဗာ" — "database server" (tech jargon, unlikely in general corpus)

Solution: Adjust threshold or use domain-specific corpus.

Corpus Bias

N-gram probabilities reflect corpus biases:

# News-heavy corpus:
"သမ္မတ ပြောကြားသည်" — "the president said" (favored, common in news)
"အမေ ပြောတယ်" — "mom said" (may be flagged, rare in news but common in speech)

Solution: Use balanced corpus or multiple domain corpora.

Short Context

Limited context may reduce accuracy:

"မြန်မာ" — single word, no context to analyze
"မြန်မာ နိုင်ငံ" — two words, only one bigram available
"မြန်မာ နိုင်ငံ သမိုင်း" — three words, bigram + trigram context available

Solution: Process longer text segments when possible.

Troubleshooting

Issue: Too many false positives

Cause: Threshold too high or corpus too narrow Solution:

from myspellchecker.core.config import SpellCheckerConfig, NgramContextConfig

config = SpellCheckerConfig(
    ngram_context=NgramContextConfig(
        bigram_threshold=0.00001,  # Lower threshold
    ),
)

Issue: Missing context errors

Cause: Threshold too low or missing N-grams Solution:

from myspellchecker.core.config import SpellCheckerConfig, NgramContextConfig

config = SpellCheckerConfig(
    ngram_context=NgramContextConfig(
        bigram_threshold=0.001,  # Higher threshold
    ),
)

Issue: Slow context checking

Cause: Large N-gram database or many lookups Solution: Enable caching via AlgorithmCacheConfig to speed up repeated lookups:

from myspellchecker.core.config import SpellCheckerConfig, AlgorithmCacheConfig

config = SpellCheckerConfig(
    cache=AlgorithmCacheConfig(
        bigram_cache_size=32768,   # Increase bigram cache
        trigram_cache_size=32768,  # Increase trigram cache
    ),
)

Next Steps

Grammar Checking - Rule-based syntactic validation
Semantic Checking - AI-powered context analysis
N-gram Algorithm - Deep dive into N-gram models

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​Understanding Real-Word Errors

​The Problem

​The Solution: N-gram Analysis

​How It Works

​Bigram Model

​Trigram Model

​Probability Threshold

​Configuration

​Enable Context Checking

​Context Settings

​NgramContextChecker Configuration

​Smoothing Strategies

​Skipped Context Words

​Context Error Types

​Word Substitution

​N-gram Probability Calculation

​Bigram Probability

​Trigram Probability

​Smoothing

​Performance Characteristics

​API Reference

​Using SpellChecker for Context Validation

​NgramContextChecker

​Common Patterns

​Context-Aware Autocomplete

​Detect Uncommon Phrases

​Context-Only Validation

​Limitations

​Rare Valid Phrases

​Corpus Bias

​Short Context

​Troubleshooting

​Issue: Too many false positives

​Issue: Missing context errors

​Issue: Slow context checking

​Next Steps