Overview
mySpellChecker uses a multi-layered approach with specialized algorithms at each level:| Layer | Algorithm | Purpose | Complexity |
|---|---|---|---|
| Syllable | Rule-based FSM | Structure validation | O(n) |
| Word | SymSpell | Fast suggestions | O(1) |
| Grammar | POS + Rules | Syntactic validation | O(n) |
| Context | N-gram LM | Probability checking | O(1) |
| Semantic | ONNX Neural | Deep understanding | O(n) |
Contents
Core Algorithms
- SymSpell - O(1) spelling correction algorithm
- Edit Distance - Levenshtein, Damerau-Levenshtein, weighted algorithms
- N-gram Model - Statistical language modeling
- Viterbi - POS tagging with HMM
Suggestion System
- Suggestion Strategy - Pluggable suggestion generation interface
- Suggestion Ranking - Multi-factor ranking algorithms
Context & Grammar
- Context-Aware Validation - Context and grammar validation overview
- Grammar Rules - Syntactic grammar rule engine
- Semantic Analysis - Neural semantic validation
POS and Tagging
- POS Disambiguator - Context-based POS disambiguation rules
- Joint Segment Tagger - Unified segmentation and POS tagging
- Tone Disambiguation - Context-aware tone mark inference
Text Processing
- Normalization - Text preprocessing and Unicode handling
- Segmentation - Word segmentation algorithms
- Syllable Segmentation - Rule-based syllable breaking
Entity & Pattern Recognition
- Named Entity Recognition - Heuristic NER for proper nouns
- Phonetic Matching - Sound-based similarity matching
Algorithm Selection
When to Use Each Algorithm
The validation pipeline processes text through these stages in order:| Stage | Algorithm | When It Runs | What It Does |
|---|---|---|---|
| Normalization | Unicode NFC + zero-width removal | Always | Cleans input text |
| Segmentation | Rule-based syllable breaking | Always | Splits text into syllables |
| Layer 1: Syllable | FSM + dictionary lookup | Always | Validates syllable structure |
| SymSpell | On invalid syllables | Generates correction suggestions | |
| Layer 2: Word | SymSpell | ValidationLevel.WORD | Validates words, suggests fixes |
| Layer 2.5: Grammar | Viterbi POS + rule engine | ValidationLevel.WORD | Checks syntactic correctness |
| Layer 3: Context | N-gram probabilities | ValidationLevel.WORD | Checks word-in-context fitness |
| ONNX semantic model | When enabled | Deep context understanding |
Performance Characteristics
Time Complexity
| Algorithm | Best | Average | Worst |
|---|---|---|---|
| Normalization | O(n) | O(n) | O(n) |
| Segmentation | O(n) | O(n) | O(n) |
| Syllable Validation | O(1) | O(1) | O(k)* |
| SymSpell Lookup | O(1) | O(1) | O(1) |
| Viterbi POS | O(nT²) | O(nT²) | O(nT²) |
| N-gram Lookup | O(1) | O(1) | O(1) |
Space Complexity
| Algorithm | Memory | Notes |
|---|---|---|
| SymSpell | ~100MB | Pre-computed deletes |
| N-gram Model | ~50MB | Indexed probabilities |
| Viterbi | O(nT) | Dynamic programming table |
| Semantic | ~300MB | ONNX model |
Implementation Notes
Cython Optimizations
Performance-critical algorithms are implemented in Cython:Parallel Processing
Batch operations use OpenMP for parallelization:Quick Reference
Algorithm Parameters
| Algorithm | Key Parameter | Default | Range |
|---|---|---|---|
| SymSpell | max_edit_distance | 2 | 1-3 |
| SymSpell | prefix_length | 10 | 4-10 |
| N-gram | bigram_threshold | 0.0001 | 0-1 |
| Viterbi | min_prob | 1e-10 | 1e-15 to 1e-5 |
Tuning Guidelines
- Speed priority: Use edit distance 1, disable context
- Accuracy priority: Use edit distance 2, enable context
- Memory constrained: Use SQLite provider, disable semantic
See Also
- Architecture Overview - System design
- Performance Tuning - Optimization guide
- API Reference - Programmatic access