Overview
mySpellChecker provides two training pipelines:1. Semantic Model (MLM) Training
Trains a custom Masked Language Model for semantic validation:| Stage | Output | Purpose |
|---|---|---|
| Tokenizer | tokenizer.json | Byte-Level BPE tokenizer for Myanmar |
| Model Training | PyTorch checkpoint | Masked Language Model |
| ONNX Export | model.onnx | Optimized inference model |
2. Error Detection Training
Fine-tunes XLM-RoBERTa for token classification (error detection):| Stage | Output | Purpose |
|---|---|---|
| Preprocess | Cleaned text | Zawgyi conversion, normalization, filtering |
| Synthetic Errors | Corrupted + labels | Training data from YAML rules |
| Fine-tuning | PyTorch checkpoint | Token classifier (CORRECT/ERROR) |
| ONNX Export | model.onnx | Optimized inference model |
Prerequisites
Install the training dependencies:torch- PyTorch for model trainingtransformers- HuggingFace Transformers for model architecturestokenizers- Fast tokenizer libraryonnx- ONNX export supportonnxruntime- ONNX inference runtime
Quick Start
The simplest way to train a model:Model Architectures
The training pipeline supports two transformer architectures:RoBERTa (Default)
RoBERTa (Robustly Optimized BERT Pretraining Approach) is recommended for most use cases:- Dynamic masking during training
- No Next Sentence Prediction (NSP) objective
- Larger batch sizes and more training data typically improve results
BERT
BERT (Bidirectional Encoder Representations from Transformers):- Static masking
- Includes NSP objective capability
- Well-suited for tasks requiring sentence-pair understanding
Configuration Options
TrainingConfig Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
input_file | str | Required | Path to training corpus (one sentence per line) |
output_dir | str | Required | Directory to save model and artifacts |
vocab_size | int | 30,000 | Vocabulary size for BPE tokenizer |
min_frequency | int | 2 | Minimum frequency for token inclusion |
epochs | int | 5 | Number of training epochs |
batch_size | int | 16 | Batch size per device |
learning_rate | float | 5e-5 | Peak learning rate |
hidden_size | int | 256 | Size of hidden layers |
num_layers | int | 4 | Number of transformer layers |
num_heads | int | 4 | Number of attention heads |
max_length | int | 128 | Maximum sequence length |
architecture | str | ”roberta” | Model architecture (“roberta” or “bert”) |
resume_from_checkpoint | str | None | Path to checkpoint directory to resume from |
warmup_ratio | float | 0.1 | Ratio of steps for learning rate warmup |
weight_decay | float | 0.01 | Weight decay for optimizer |
save_metrics | bool | True | Save training metrics to JSON file |
keep_checkpoints | bool | False | Keep intermediate checkpoints |
Architecture Constraints
Thehidden_size must be divisible by num_heads. Valid combinations include:
- hidden_size=256, num_heads=4 (64 per head)
- hidden_size=256, num_heads=8 (32 per head)
- hidden_size=512, num_heads=8 (64 per head)
Learning Rate Scheduling
The training pipeline uses linear learning rate scheduling with warmup:- Starts at 0
- Linearly increases to
learning_rateoverwarmup_ratio * total_steps - Linearly decreases to 0 over remaining steps
Resume Training from Checkpoint
Training can be resumed from a checkpoint if interrupted:Training Metrics
Whensave_metrics=True (default), training metrics are saved to training_metrics.json:
step: Global training stepepoch: Current epoch (fractional)loss: Training losslearning_rate: Current learning rate
Low-Level API
For more control, useModelTrainer directly:
ONNX Export
Models are automatically exported to ONNX format with INT8 quantization:SemanticChecker for context-aware validation.
Output Files:
model.onnx- Quantized model (default)model.base.onnx- Original FP32 modeltokenizer.json- Copied for convenience
Using Trained Models
With SemanticChecker
Standalone Inference
CLI Usage
Train a model via CLI:Corpus Format
The training corpus should be a text file with one sentence per line:- UTF-8 encoding
- One sentence per line
- Minimum 100 lines (recommended: 10,000+ lines)
- Segmented text (spaces between words) works best
GPU Support
Training automatically uses GPU if available:Batch Size by GPU VRAM
| GPU VRAM | Recommended batch_size |
|---|---|
| 4GB | 8 |
| 8GB | 16 |
| 16GB | 32 |
| 24GB+ | 64 |
Model Size vs Quality
| Configuration | Parameters | Quality | Speed |
|---|---|---|---|
| Small (default) | ~5M | Good | Fast |
| Medium | ~20M | Better | Medium |
| Large | ~100M | Best | Slow |
Best Practices
- Corpus Size: Use at least 10,000 sentences for meaningful results
- Batch Size: Larger batches (16-32) generally train faster on GPU
- Hidden Size: Start with 256 for small models, 512 for larger ones
- Epochs: 5-10 epochs is usually sufficient; monitor loss for overfitting
- Warmup: 10% warmup (0.1) helps training stability
- Checkpoints: Enable
keep_checkpoints=Truefor long training runs - Metrics: Always save metrics to monitor training progress
Troubleshooting
Memory Issues
Slow Training
Invalid hidden_size/num_heads
Error Detection Training
Overview
The error detection pipeline trains a token classification model that detects errors in a single forward pass (~10ms). Unlike the MLM pipeline which trains from scratch, this pipeline fine-tunes a pre-trained XLM-RoBERTa model. Training uses only clean data — the library generates synthetic errors from its existing YAML rules (homophones, typos, phonetic data).Quick Start
CLI Usage
ErrorDetectionConfig Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
input_file | str | Required | Path to clean corpus (one sentence per line) |
output_dir | str | Required | Directory to save model artifacts |
base_model | str | xlm-roberta-base | Pre-trained model to fine-tune |
epochs | int | 3 | Number of training epochs |
batch_size | int | 16 | Batch size per device |
learning_rate | float | 2e-5 | Peak learning rate |
corruption_ratio | float | 0.15 | Fraction of words to corrupt per sentence |
max_length | int | 256 | Maximum sequence length |
keep_checkpoints | bool | False | Keep intermediate checkpoints |
save_metrics | bool | True | Save training metrics to JSON |
seed | int | None | Random seed for reproducibility |
skip_preprocessing | bool | False | Skip corpus preprocessing |
Corpus Preprocessing
By default, the pipeline preprocesses the corpus before generating synthetic errors:- Zawgyi detection and conversion — Converts Zawgyi-encoded text to Unicode
- Unicode normalization — NFC normalization, diacritic reordering, zero-width removal
- Quality filtering — Removes non-Myanmar text, too-short/too-long lines
--skip-preprocessing only if your corpus is already clean Unicode.
Synthetic Error Generation
TheSyntheticErrorGenerator creates training data by corrupting clean sentences:
| Corruption Type | Weight | Source |
|---|---|---|
| Homophone swap | 30% | rules/homophones.yaml |
| Medial confusion (ျ↔ြ, ွ↔ှ) | 20% | Myanmar character sets |
| Similar char swap | 15% | phonetic_data.py VISUAL_SIMILAR |
| Character deletion | 15% | Asat, tone marks, vowels |
| Character insertion | 10% | Double stacking, extra tone |
| Typo pattern | 10% | rules/typo_corrections.yaml (inverted) |
ERROR (1), unchanged words labeled CORRECT (0).
Pipeline Steps
The full pipeline runs 6 steps:- Load Corpus — Read input file
- Preprocess — Zawgyi conversion, normalization, filtering (skippable)
- Generate Synthetic Errors — Corrupt words using YAML rules
- Build Dataset — Tokenize with subword alignment, create labels
- Train Token Classifier — Fine-tune XLM-RoBERTa
- Export to ONNX — Quantized model for inference
Module Files
The error detection training pipeline is implemented insrc/myspellchecker/training/error_detection/:
| File | Description |
|---|---|
constants.py | Labels (CORRECT/ERROR), corruption weights, default hyperparameters |
generator.py | SyntheticErrorGenerator — rule-based corruption using YAML rules |
dataset.py | ErrorDetectionDataset — torch Dataset with subword tokenization and label alignment |
iterable_dataset.py | ErrorDetectionIterableDataset — streaming IterableDataset that reads pre-generated JSONL examples line-by-line, avoiding loading the entire corpus into RAM |
alignment.py | Token-label alignment utilities — maps word-level labels to subword tokens using binary search (O(T log W)) |
trainer.py | ErrorDetectionTrainer — fine-tunes XLM-RoBERTa for token classification |
pipeline.py | ErrorDetectionPipeline + ErrorDetectionConfig — end-to-end orchestrator |
Using the Trained Model
Comparing Training Pipelines
| Aspect | train-model (MLM) | train-detector (Token Classifier) |
|---|---|---|
| Purpose | Semantic validation | Error detection |
| Base model | Train from scratch | Fine-tune XLM-RoBERTa |
| Training data | Raw corpus | Clean corpus + synthetic errors |
| Preprocessing | None (user responsibility) | Built-in (Zawgyi, normalization) |
| Inference | N forward passes per sentence | Single forward pass |
| Speed | ~200ms per sentence | ~10ms per sentence |
| Output | Suggestions + error detection | Error detection only |
| CLI | train-model | train-detector |
See Also
- Error Detection - Error detection feature overview
- Semantic Checking - Using trained models for context validation
- CLI Reference -
train-modelandtrain-detectorcommand details - Configuration Guide - SemanticConfig options