Overview
N-gram models capture the likelihood of word sequences, enabling detection of “real-word errors” — correctly spelled words used incorrectly in context.How It Works
Bigram Model (2-gram)
Calculates the probability of word pairs:Trigram Model (3-gram)
Extends to word triplets for richer context:Smoothing Strategies
The library supports multiple smoothing strategies for handling unseen N-grams:Stupid Backoff (Default)
Fast and effective for most use cases:Add-K (Laplace) Smoothing
Adds constant k to all counts:No Smoothing
Returns raw probabilities (for pre-smoothed data):Configuration
NgramContextChecker and NgramContextConfig have different defaults for some parameters. When using NgramContextConfig (e.g., via SpellCheckerConfig), the config defaults apply: threshold=0.01 (from AlgorithmDefaults.NGRAM_THRESHOLD), trigram_threshold=0.0001, edit_distance_weight=0.6, probability_weight=0.4. When constructing NgramContextChecker directly, the class defaults apply: threshold=0.01, trigram_threshold=0.005, edit_distance_weight=0.3, probability_weight=0.7. The config-based path is recommended for consistency.
Error Detection
The checker uses a two-path detection strategy based on raw trigram availability:-
Trigram path: When a raw trigram probability exists in the corpus (
P_raw(w3|w1,w2) > 0), the checker uses the trigram-specific threshold (trigram_threshold, default0.005) to determine if the word is an error. This avoids false positives from smoothed backoff values. - Bigram fallback path: When no raw trigram is found, the checker falls back to bigram probabilities with bidirectional context checking, unigram backoff for common words, and typo neighbor detection via SymSpell.
Suggestion Generation
Suggestions are generated by:- Finding words with higher conditional probability
- Filtering by edit distance (max 2)
- Ranking by combined probability and distance score
Performance
| Operation | Complexity | Typical Time |
|---|---|---|
| Bigram lookup | O(1) | <1ms |
| Trigram lookup | O(1) | <1ms |
| Context analysis | O(n) | ~100ms for avg text |
| Suggestion generation | O(k) | ~50ms |
Database Schema
N-gram data is stored in SQLite using INTEGER foreign keys referencing thewords table (not TEXT columns):
See Also
- Context Checking - Feature documentation
- SymSpell Algorithm - Word-level suggestions
- Semantic Checking - AI-powered context analysis
- Performance Tuning - Optimization