Skip to main content
mySpellChecker’s AI features (semantic checking and error detection) require models that you train on your own corpus. The library does not ship with pre-trained models — instead it provides two complete training pipelines that handle tokenizer creation, model training, and ONNX export.

Overview

mySpellChecker provides two training pipelines:

1. Semantic Model (MLM) Training

Trains a custom Masked Language Model for semantic validation:
Raw Text → Tokenizer Training → Model Training → ONNX Export
StageOutputPurpose
Tokenizertokenizer.jsonByte-Level BPE tokenizer for Myanmar
Model TrainingPyTorch checkpointMasked Language Model
ONNX Exportmodel.onnxOptimized inference model

2. Error Detection Training

Fine-tunes XLM-RoBERTa for token classification (error detection):
Clean Corpus → Preprocess → Synthetic Errors → Fine-tune XLM-R → ONNX Export
StageOutputPurpose
PreprocessCleaned textZawgyi conversion, normalization, filtering
Synthetic ErrorsCorrupted + labelsTraining data from YAML rules
Fine-tuningPyTorch checkpointToken classifier (CORRECT/ERROR)
ONNX Exportmodel.onnxOptimized inference model

Prerequisites

Install the training dependencies:
pip install myspellchecker[train]
This installs:
  • torch - PyTorch for model training
  • transformers - HuggingFace Transformers for model architectures
  • tokenizers - Fast tokenizer library
  • onnx - ONNX export support
  • onnxruntime - ONNX inference runtime

Quick Start

The simplest way to train a model:
from myspellchecker.training import TrainingPipeline, TrainingConfig

# Configure training
config = TrainingConfig(
    input_file="corpus.txt",  # One sentence per line
    output_dir="./my_model",
    epochs=5,
)

# Run training
pipeline = TrainingPipeline()
model_path = pipeline.run(config)
print(f"Model saved to: {model_path}")

Model Architectures

The training pipeline supports two transformer architectures:

RoBERTa (Default)

RoBERTa (Robustly Optimized BERT Pretraining Approach) is recommended for most use cases:
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./roberta_model",
    architecture="roberta",  # Default
)
Key characteristics:
  • Dynamic masking during training
  • No Next Sentence Prediction (NSP) objective
  • Larger batch sizes and more training data typically improve results

BERT

BERT (Bidirectional Encoder Representations from Transformers):
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./bert_model",
    architecture="bert",
)
Key characteristics:
  • Static masking
  • Includes NSP objective capability
  • Well-suited for tasks requiring sentence-pair understanding

Configuration Options

TrainingConfig Parameters

ParameterTypeDefaultDescription
input_filestrRequiredPath to training corpus (one sentence per line)
output_dirstrRequiredDirectory to save model and artifacts
vocab_sizeint30,000Vocabulary size for BPE tokenizer
min_frequencyint2Minimum frequency for token inclusion
epochsint5Number of training epochs
batch_sizeint16Batch size per device
learning_ratefloat5e-5Peak learning rate
hidden_sizeint256Size of hidden layers
num_layersint4Number of transformer layers
num_headsint4Number of attention heads
max_lengthint128Maximum sequence length
architecturestr”roberta”Model architecture (“roberta” or “bert”)
resume_from_checkpointstrNonePath to checkpoint directory to resume from
warmup_ratiofloat0.1Ratio of steps for learning rate warmup
weight_decayfloat0.01Weight decay for optimizer
save_metricsboolTrueSave training metrics to JSON file
keep_checkpointsboolFalseKeep intermediate checkpoints

Architecture Constraints

The hidden_size must be divisible by num_heads. Valid combinations include:
  • hidden_size=256, num_heads=4 (64 per head)
  • hidden_size=256, num_heads=8 (32 per head)
  • hidden_size=512, num_heads=8 (64 per head)

Learning Rate Scheduling

The training pipeline uses linear learning rate scheduling with warmup:
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    learning_rate=5e-5,     # Peak learning rate
    warmup_ratio=0.1,       # 10% of steps for warmup
    weight_decay=0.01,      # AdamW weight decay
)
The learning rate:
  1. Starts at 0
  2. Linearly increases to learning_rate over warmup_ratio * total_steps
  3. Linearly decreases to 0 over remaining steps

Resume Training from Checkpoint

Training can be resumed from a checkpoint if interrupted:
# Initial training
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    epochs=10,
    keep_checkpoints=True,  # Keep checkpoints for resume
)
pipeline = TrainingPipeline()
pipeline.run(config)  # Interrupted at epoch 5

# Resume training
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    epochs=10,
    resume_from_checkpoint="./model/checkpoints/checkpoint-500",
)
pipeline.run(config)  # Continues from checkpoint
Checkpoints are saved every 500 steps by default.

Training Metrics

When save_metrics=True (default), training metrics are saved to training_metrics.json:
[
  {"step": 50, "epoch": 0.5, "loss": 8.234, "learning_rate": 2.5e-5},
  {"step": 100, "epoch": 1.0, "loss": 6.891, "learning_rate": 5e-5},
  ...
]
Metrics include:
  • step: Global training step
  • epoch: Current epoch (fractional)
  • loss: Training loss
  • learning_rate: Current learning rate

Low-Level API

For more control, use ModelTrainer directly:
from myspellchecker.training import ModelTrainer, ModelArchitecture

trainer = ModelTrainer()

# Step 1: Train tokenizer
tokenizer_path = trainer.train_tokenizer(
    corpus_path="corpus.txt",
    output_dir="./tokenizer",
    vocab_size=30000,
)

# Step 2: Train model
model_path = trainer.train_model(
    corpus_path="corpus.txt",
    tokenizer_path=tokenizer_path,
    output_dir="./model",
    architecture=ModelArchitecture.ROBERTA,
    epochs=5,
    warmup_ratio=0.1,
    save_metrics=True,
)

ONNX Export

Models are automatically exported to ONNX format with INT8 quantization:
from myspellchecker.training import ONNXExporter

exporter = ONNXExporter()
exporter.export(
    model_dir="./pytorch_model",
    output_dir="./onnx_model",
    quantize=True,  # INT8 quantization
)
The exported ONNX model can be used with SemanticChecker for context-aware validation. Output Files:
  • model.onnx - Quantized model (default)
  • model.base.onnx - Original FP32 model
  • tokenizer.json - Copied for convenience

Using Trained Models

With SemanticChecker

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="./models/model.onnx",
        tokenizer_path="./models/tokenizer.json",
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

Standalone Inference

import onnxruntime as ort
from transformers import PreTrainedTokenizerFast

# Load model and tokenizer
session = ort.InferenceSession("./models/model.onnx")
tokenizer = PreTrainedTokenizerFast(tokenizer_file="./models/tokenizer.json")

# Prepare input
text = "မြန်မာ<mask>သည်"
inputs = tokenizer(text, return_tensors="np")

# Run inference
outputs = session.run(
    ["logits"],
    {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
    }
)

CLI Usage

Train a model via CLI:
# Basic training
myspellchecker train-model -i corpus.txt -o ./models/

# With custom parameters
myspellchecker train-model -i corpus.txt -o ./models/ \
  --architecture roberta \
  --epochs 10 \
  --hidden-size 512 \
  --layers 6 \
  --heads 8 \
  --learning-rate 3e-5

# Resume from checkpoint
myspellchecker train-model -i corpus.txt -o ./models/ \
  --resume ./models/checkpoints/checkpoint-500

Corpus Format

The training corpus should be a text file with one sentence per line:
ကျွန်တော် မြန်မာ စာ လေ့လာ နေ ပါ တယ်
သူမ က စာအုပ် ဖတ် နေ တယ်
ဒီ နေ့ ရာသီ ဥတု ကောင်း တယ်
Requirements:
  • UTF-8 encoding
  • One sentence per line
  • Minimum 100 lines (recommended: 10,000+ lines)
  • Segmented text (spaces between words) works best

GPU Support

Training automatically uses GPU if available:
import torch
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Batch Size by GPU VRAM

GPU VRAMRecommended batch_size
4GB8
8GB16
16GB32
24GB+64
For CPU-only training:
# Training will automatically fall back to CPU if no GPU available
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    batch_size=8,  # Reduce batch size for CPU
)

Model Size vs Quality

ConfigurationParametersQualitySpeed
Small (default)~5MGoodFast
Medium~20MBetterMedium
Large~100MBestSlow
# Small (default)
config = TrainingConfig(hidden_size=256, num_layers=4, num_heads=4)

# Medium
config = TrainingConfig(hidden_size=512, num_layers=6, num_heads=8)

# Large
config = TrainingConfig(hidden_size=768, num_layers=12, num_heads=12)

Best Practices

  1. Corpus Size: Use at least 10,000 sentences for meaningful results
  2. Batch Size: Larger batches (16-32) generally train faster on GPU
  3. Hidden Size: Start with 256 for small models, 512 for larger ones
  4. Epochs: 5-10 epochs is usually sufficient; monitor loss for overfitting
  5. Warmup: 10% warmup (0.1) helps training stability
  6. Checkpoints: Enable keep_checkpoints=True for long training runs
  7. Metrics: Always save metrics to monitor training progress

Troubleshooting

Memory Issues

# Reduce batch size and max_length
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    batch_size=4,
    max_length=64,
)

Slow Training

# Check GPU availability
import torch
print(torch.cuda.is_available())

# Reduce model complexity
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    hidden_size=128,
    num_layers=2,
)

Invalid hidden_size/num_heads

# hidden_size must be divisible by num_heads
# This will raise ValueError:
config = TrainingConfig(
    hidden_size=256,
    num_heads=3,  # Error: 256 not divisible by 3
)

# Valid configuration:
config = TrainingConfig(
    hidden_size=256,
    num_heads=4,  # OK: 256 / 4 = 64
)

Error Detection Training

Overview

The error detection pipeline trains a token classification model that detects errors in a single forward pass (~10ms). Unlike the MLM pipeline which trains from scratch, this pipeline fine-tunes a pre-trained XLM-RoBERTa model. Training uses only clean data — the library generates synthetic errors from its existing YAML rules (homophones, typos, phonetic data).

Quick Start

from myspellchecker.training.error_detection import (
    ErrorDetectionPipeline,
    ErrorDetectionConfig,
)

config = ErrorDetectionConfig(
    input_file="corpus.txt",  # Clean text, one sentence per line
    output_dir="./detector",
    epochs=3,
)

pipeline = ErrorDetectionPipeline()
model_path = pipeline.run(config)
print(f"Detector saved to: {model_path}")

CLI Usage

# Basic training
myspellchecker train-detector -i corpus.txt -o ./detector/

# With custom parameters
myspellchecker train-detector -i corpus.txt -o ./detector/ \
  --epochs 5 --corruption-ratio 0.10 --batch-size 32

# Skip preprocessing (if corpus is already clean)
myspellchecker train-detector -i corpus.txt -o ./detector/ \
  --skip-preprocessing

ErrorDetectionConfig Parameters

ParameterTypeDefaultDescription
input_filestrRequiredPath to clean corpus (one sentence per line)
output_dirstrRequiredDirectory to save model artifacts
base_modelstrxlm-roberta-basePre-trained model to fine-tune
epochsint3Number of training epochs
batch_sizeint16Batch size per device
learning_ratefloat2e-5Peak learning rate
corruption_ratiofloat0.15Fraction of words to corrupt per sentence
max_lengthint256Maximum sequence length
keep_checkpointsboolFalseKeep intermediate checkpoints
save_metricsboolTrueSave training metrics to JSON
seedintNoneRandom seed for reproducibility
skip_preprocessingboolFalseSkip corpus preprocessing

Corpus Preprocessing

By default, the pipeline preprocesses the corpus before generating synthetic errors:
  1. Zawgyi detection and conversion — Converts Zawgyi-encoded text to Unicode
  2. Unicode normalization — NFC normalization, diacritic reordering, zero-width removal
  3. Quality filtering — Removes non-Myanmar text, too-short/too-long lines
This is critical because the corpus defines ground truth labels — any Zawgyi or unnormalized text would be labeled as CORRECT, teaching the model to accept bad text. Use --skip-preprocessing only if your corpus is already clean Unicode.

Synthetic Error Generation

The SyntheticErrorGenerator creates training data by corrupting clean sentences:
Corruption TypeWeightSource
Homophone swap30%rules/homophones.yaml
Medial confusion (ျ↔ြ, ွ↔ှ)20%Myanmar character sets
Similar char swap15%phonetic_data.py VISUAL_SIMILAR
Character deletion15%Asat, tone marks, vowels
Character insertion10%Double stacking, extra tone
Typo pattern10%rules/typo_corrections.yaml (inverted)
Each corrupted word is labeled ERROR (1), unchanged words labeled CORRECT (0).

Pipeline Steps

The full pipeline runs 6 steps:
  1. Load Corpus — Read input file
  2. Preprocess — Zawgyi conversion, normalization, filtering (skippable)
  3. Generate Synthetic Errors — Corrupt words using YAML rules
  4. Build Dataset — Tokenize with subword alignment, create labels
  5. Train Token Classifier — Fine-tune XLM-RoBERTa
  6. Export to ONNX — Quantized model for inference

Module Files

The error detection training pipeline is implemented in src/myspellchecker/training/error_detection/:
FileDescription
constants.pyLabels (CORRECT/ERROR), corruption weights, default hyperparameters
generator.pySyntheticErrorGenerator — rule-based corruption using YAML rules
dataset.pyErrorDetectionDataset — torch Dataset with subword tokenization and label alignment
iterable_dataset.pyErrorDetectionIterableDataset — streaming IterableDataset that reads pre-generated JSONL examples line-by-line, avoiding loading the entire corpus into RAM
alignment.pyToken-label alignment utilities — maps word-level labels to subword tokens using binary search (O(T log W))
trainer.pyErrorDetectionTrainer — fine-tunes XLM-RoBERTa for token classification
pipeline.pyErrorDetectionPipeline + ErrorDetectionConfig — end-to-end orchestrator

Using the Trained Model

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.algorithm_configs import ErrorDetectorConfig

config = SpellCheckerConfig(
    error_detector=ErrorDetectorConfig(
        model_path="./detector/onnx/model.onnx",
        tokenizer_path="./detector/onnx",
        confidence_threshold=0.7,
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

Comparing Training Pipelines

Aspecttrain-model (MLM)train-detector (Token Classifier)
PurposeSemantic validationError detection
Base modelTrain from scratchFine-tune XLM-RoBERTa
Training dataRaw corpusClean corpus + synthetic errors
PreprocessingNone (user responsibility)Built-in (Zawgyi, normalization)
InferenceN forward passes per sentenceSingle forward pass
Speed~200ms per sentence~10ms per sentence
OutputSuggestions + error detectionError detection only
CLItrain-modeltrain-detector

See Also