Two AI approaches: mySpellChecker has two AI-powered strategies:Semantic Validation is more thorough; Error Detection is faster. They can be used together.
- Semantic Validation (this page): MLM-based, masks each word, provides suggestions (~200ms)
- Error Detection: Token classification, single forward pass, detects errors only (~10ms)
How It Works
mySpellChecker uses a Masked Language Model (MLM) approach, similar to BERT or RoBERTa.- Masking: The system takes a sentence and hides the suspicious word.
- Sentence: “မောင်မောင်က အောင်အောင်ကို လှမ်းပျော်လိုက်သည်။” (Maung Maung [happy] Aung Aung).
- Masked: “မောင်မောင်က အောင်အောင်ကို [MASK]လိုက်သည်။”
- Prediction: The AI model predicts the most likely words to fill the hole based on the entire sentence context.
- Predictions: “ပြော” (Say - 99%), “ကြည့်” (Look - 0.5%)…
- Comparison:
- The original word “ပျော်” (Happy) is contextually nonsense here (very low probability).
- A phonetically similar neighbor “ပြော” (Say) has high probability.
- The system flags this as a semantic error and suggests “ပြော”.
Architecture
- Model Format: ONNX (Open Neural Network Exchange) for high-performance inference on CPU.
- Tokenizer: HFTokenizerWrapper (adapts HuggingFace tokenizers like XLM-RoBERTa, mBERT) or custom tokenizer.json (via tokenizers library).
- Optimization: The model is quantized (int8) to reduce size and increase speed.
Training Your Own Model
Since generic models may not cover your specific domain (e.g., medical, legal), mySpellChecker provides a built-in training pipeline. You can train a custom model on your own text corpus without needing a GPU cluster or cloud API.Train
Use the
train-model CLI command. This handles tokenization, training (RoBERTa), and ONNX export automatically.Usage
Prerequisites
Configuration
You can load the model using file paths or pass pre-loaded objects. Option A: File Paths (Simple)Performance Considerations
- Latency: Neural network inference is slower than N-gram lookup (~50ms - 150ms on CPU).
- Strategy: Use Semantic Validation when accuracy is paramount (e.g., final proofreading, offline batch processing).
Related Examples
While we don’t have a standalone Semantic Model demo (as it requires training a model first), the Context Aware Demo illustrates the principles of context checking. To adapt that example for Semantic Validation:- Train your model using
myspellchecker train-model. - Update the
SpellCheckerConfigin the example script to include aSemanticConfigwithmodel_pathandtokenizer_path. - The
check()call remains exactly the same (level="word"), but the results will now include AI-powered suggestions!
See Also
- Error Detection - Faster AI alternative using token classification (~10ms)
- Training Guide - Training both MLM and error detection models
- Semantic Checking Feature - Feature-level documentation