Understanding Real-Word Errors
The Problem
Consider this Myanmar sentence:The Solution: N-gram Analysis
mySpellChecker uses N-gram probabilities to detect statistically unlikely word combinations:How It Works
Bigram Model
Analyzes pairs of adjacent words:Trigram Model
Analyzes triplets for more context:Probability Threshold
Words below a probability threshold are flagged:Configuration
Enable Context Checking
Context Settings
NgramContextChecker Configuration
Smoothing Strategies
The library supports multiple smoothing strategies for handling unseen N-grams:| Strategy | Description | Use Case |
|---|---|---|
NONE | No smoothing, raw probabilities | Pre-smoothed data |
STUPID_BACKOFF | Simple backoff with configurable weight (default) | General use, fast |
ADD_K | Add-k (Laplace) smoothing | Small vocabularies |
Skipped Context Words
The context validator skips 28 high-frequency Myanmar particles that appear in virtually all contexts and have low discriminative power. These particles are never flagged as context errors:- Subject/object markers:
က,ကို,သည်,တယ် - Locative particles:
မှာ,မှ,တွင် - Comitative/conjunctive:
နဲ့,နှင့်,နှင် - Genitive/possessive:
ရဲ့,၏ - Emphasis/interjection:
ကွာ,ဗျာ,နော်,ဟေ့,ကွ,လေ,ပါ,ပဲ,ပေါ့ - Other common particles:
များ,လည်း,တော့,ပြီး,ဖို့,အတွက် - Question particles:
လား,လဲ
core/constants/myanmar_constants.py:SKIPPED_CONTEXT_WORDS.
Context Error Types
Word Substitution
Correct word, wrong context:N-gram Probability Calculation
Bigram Probability
Trigram Probability
Smoothing
For unseen N-grams, smoothing prevents zero probabilities:SmoothingStrategy enum:
Performance Characteristics
| Metric | Value |
|---|---|
| Speed | Moderate |
| Bigram Lookup | O(1) |
| Trigram Lookup | O(1) |
| Context Analysis | O(n) where n = word count |
API Reference
Using SpellChecker for Context Validation
ContextValidator requires a DI container setup
with registered validation strategies. For most use cases, use SpellChecker.check()
with use_context_checker=True in the config.
NgramContextChecker
Common Patterns
Context-Aware Autocomplete
Detect Uncommon Phrases
Context-Only Validation
Limitations
Rare Valid Phrases
Unusual but correct phrases may be flagged:Corpus Bias
N-gram probabilities reflect corpus biases:Short Context
Limited context may reduce accuracy:Troubleshooting
Issue: Too many false positives
Cause: Threshold too high or corpus too narrow Solution:Issue: Missing context errors
Cause: Threshold too low or missing N-grams Solution:Issue: Slow context checking
Cause: Large N-gram database or many lookups Solution: Enable caching viaAlgorithmCacheConfig to speed up repeated lookups:
Next Steps
- Grammar Checking - Rule-based syntactic validation
- Semantic Checking - AI-powered context analysis
- N-gram Algorithm - Deep dive into N-gram models