Training Custom Models - mySpellChecker

mySpellChecker’s AI features (semantic checking and error detection) require models that you train on your own corpus. The library does not ship with pre-trained models — instead it provides two complete training pipelines that handle tokenizer creation, model training, and ONNX export.

Overview

mySpellChecker provides two training pipelines:

1. Semantic Model (MLM) Training

Trains a custom Masked Language Model for semantic validation:

Raw Text → Tokenizer Training → Model Training → ONNX Export

Stage	Output	Purpose
Tokenizer	`tokenizer.json`	Byte-Level BPE tokenizer for Myanmar
Model Training	PyTorch checkpoint	Masked Language Model
ONNX Export	`model.onnx`	Optimized inference model

2. Error Detection Training

Fine-tunes XLM-RoBERTa for token classification (error detection):

Clean Corpus → Preprocess → Synthetic Errors → Fine-tune XLM-R → ONNX Export

Stage	Output	Purpose
Preprocess	Cleaned text	Zawgyi conversion, normalization, filtering
Synthetic Errors	Corrupted + labels	Training data from YAML rules
Fine-tuning	PyTorch checkpoint	Token classifier (CORRECT/ERROR)
ONNX Export	`model.onnx`	Optimized inference model

Prerequisites

Install the training dependencies:

pip install myspellchecker[train]

This installs:

torch - PyTorch for model training
transformers - HuggingFace Transformers for model architectures
tokenizers - Fast tokenizer library
onnx - ONNX export support
onnxruntime - ONNX inference runtime

Quick Start

The simplest way to train a model:

from myspellchecker.training import TrainingPipeline, TrainingConfig

# Configure training
config = TrainingConfig(
    input_file="corpus.txt",  # One sentence per line
    output_dir="./my_model",
    epochs=5,
)

# Run training
pipeline = TrainingPipeline()
model_path = pipeline.run(config)
print(f"Model saved to: {model_path}")

Model Architectures

The training pipeline supports two transformer architectures:

RoBERTa (Default)

RoBERTa (Robustly Optimized BERT Pretraining Approach) is recommended for most use cases:

config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./roberta_model",
    architecture="roberta",  # Default
)

Key characteristics:

Dynamic masking during training
No Next Sentence Prediction (NSP) objective
Larger batch sizes and more training data typically improve results

BERT

BERT (Bidirectional Encoder Representations from Transformers):

config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./bert_model",
    architecture="bert",
)

Key characteristics:

Static masking
Includes NSP objective capability
Well-suited for tasks requiring sentence-pair understanding

Configuration Options

TrainingConfig Parameters

Parameter	Type	Default	Description
`input_file`	str	Required	Path to training corpus (one sentence per line)
`output_dir`	str	Required	Directory to save model and artifacts
`vocab_size`	int	30,000	Vocabulary size for BPE tokenizer
`min_frequency`	int	2	Minimum frequency for token inclusion
`epochs`	int	5	Number of training epochs
`batch_size`	int	16	Batch size per device
`learning_rate`	float	5e-5	Peak learning rate
`hidden_size`	int	256	Size of hidden layers
`num_layers`	int	4	Number of transformer layers
`num_heads`	int	4	Number of attention heads
`max_length`	int	128	Maximum sequence length
`architecture`	str	”roberta”	Model architecture (“roberta” or “bert”)
`resume_from_checkpoint`	str	None	Path to checkpoint directory to resume from
`warmup_ratio`	float	0.1	Ratio of steps for learning rate warmup
`weight_decay`	float	0.01	Weight decay for optimizer
`save_metrics`	bool	True	Save training metrics to JSON file
`keep_checkpoints`	bool	False	Keep intermediate checkpoints

Architecture Constraints

The hidden_size must be divisible by num_heads. Valid combinations include:

hidden_size=256, num_heads=4 (64 per head)
hidden_size=256, num_heads=8 (32 per head)
hidden_size=512, num_heads=8 (64 per head)

Learning Rate Scheduling

The training pipeline uses linear learning rate scheduling with warmup:

config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    learning_rate=5e-5,     # Peak learning rate
    warmup_ratio=0.1,       # 10% of steps for warmup
    weight_decay=0.01,      # AdamW weight decay
)

The learning rate:

Starts at 0
Linearly increases to learning_rate over warmup_ratio * total_steps
Linearly decreases to 0 over remaining steps

Resume Training from Checkpoint

Training can be resumed from a checkpoint if interrupted:

# Initial training
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    epochs=10,
    keep_checkpoints=True,  # Keep checkpoints for resume
)
pipeline = TrainingPipeline()
pipeline.run(config)  # Interrupted at epoch 5

# Resume training
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    epochs=10,
    resume_from_checkpoint="./model/checkpoints/checkpoint-500",
)
pipeline.run(config)  # Continues from checkpoint

Checkpoints are saved every 500 steps by default.

Training Metrics

When save_metrics=True (default), training metrics are saved to training_metrics.json:

[
  {"step": 50, "epoch": 0.5, "loss": 8.234, "learning_rate": 2.5e-5},
  {"step": 100, "epoch": 1.0, "loss": 6.891, "learning_rate": 5e-5},
  ...
]

Metrics include:

step: Global training step
epoch: Current epoch (fractional)
loss: Training loss
learning_rate: Current learning rate

Low-Level API

For more control, use ModelTrainer directly:

from myspellchecker.training import ModelTrainer, ModelArchitecture

trainer = ModelTrainer()

# Step 1: Train tokenizer
tokenizer_path = trainer.train_tokenizer(
    corpus_path="corpus.txt",
    output_dir="./tokenizer",
    vocab_size=30000,
)

# Step 2: Train model
model_path = trainer.train_model(
    corpus_path="corpus.txt",
    tokenizer_path=tokenizer_path,
    output_dir="./model",
    architecture=ModelArchitecture.ROBERTA,
    epochs=5,
    warmup_ratio=0.1,
    save_metrics=True,
)

ONNX Export

Models are automatically exported to ONNX format with INT8 quantization:

from myspellchecker.training import ONNXExporter

exporter = ONNXExporter()
exporter.export(
    model_dir="./pytorch_model",
    output_dir="./onnx_model",
    quantize=True,  # INT8 quantization
)

The exported ONNX model can be used with SemanticChecker for context-aware validation. Output Files:

model.onnx - Quantized model (default)
model.base.onnx - Original FP32 model
tokenizer.json - Copied for convenience

Using Trained Models

With SemanticChecker

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, SemanticConfig

config = SpellCheckerConfig(
    semantic=SemanticConfig(
        model_path="./models/model.onnx",
        tokenizer_path="./models/tokenizer.json",
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

Standalone Inference

import onnxruntime as ort
from transformers import PreTrainedTokenizerFast

# Load model and tokenizer
session = ort.InferenceSession("./models/model.onnx")
tokenizer = PreTrainedTokenizerFast(tokenizer_file="./models/tokenizer.json")

# Prepare input
text = "မြန်မာ<mask>သည်"
inputs = tokenizer(text, return_tensors="np")

# Run inference
outputs = session.run(
    ["logits"],
    {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
    }
)

CLI Usage

Train a model via CLI:

# Basic training
myspellchecker train-model -i corpus.txt -o ./models/

# With custom parameters
myspellchecker train-model -i corpus.txt -o ./models/ \
  --architecture roberta \
  --epochs 10 \
  --hidden-size 512 \
  --layers 6 \
  --heads 8 \
  --learning-rate 3e-5

# Resume from checkpoint
myspellchecker train-model -i corpus.txt -o ./models/ \
  --resume ./models/checkpoints/checkpoint-500

Corpus Format

The training corpus should be a text file with one sentence per line:

ကျွန်တော် မြန်မာ စာ လေ့လာ နေ ပါ တယ်
သူမ က စာအုပ် ဖတ် နေ တယ်
ဒီ နေ့ ရာသီ ဥတု ကောင်း တယ်

Requirements:

UTF-8 encoding
One sentence per line
Minimum 100 lines (recommended: 10,000+ lines)
Segmented text (spaces between words) works best

GPU Support

Training automatically uses GPU if available:

import torch
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Batch Size by GPU VRAM

GPU VRAM	Recommended `batch_size`
4GB	8
8GB	16
16GB	32
24GB+	64

For CPU-only training:

# Training will automatically fall back to CPU if no GPU available
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    batch_size=8,  # Reduce batch size for CPU
)

Model Size vs Quality

Configuration	Parameters	Quality	Speed
Small (default)	~5M	Good	Fast
Medium	~20M	Better	Medium
Large	~100M	Best	Slow

# Small (default)
config = TrainingConfig(hidden_size=256, num_layers=4, num_heads=4)

# Medium
config = TrainingConfig(hidden_size=512, num_layers=6, num_heads=8)

# Large
config = TrainingConfig(hidden_size=768, num_layers=12, num_heads=12)

Best Practices

Corpus Size: Use at least 10,000 sentences for meaningful results
Batch Size: Larger batches (16-32) generally train faster on GPU
Hidden Size: Start with 256 for small models, 512 for larger ones
Epochs: 5-10 epochs is usually sufficient; monitor loss for overfitting
Warmup: 10% warmup (0.1) helps training stability
Checkpoints: Enable keep_checkpoints=True for long training runs
Metrics: Always save metrics to monitor training progress

Troubleshooting

Memory Issues

# Reduce batch size and max_length
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    batch_size=4,
    max_length=64,
)

Slow Training

# Check GPU availability
import torch
print(torch.cuda.is_available())

# Reduce model complexity
config = TrainingConfig(
    input_file="corpus.txt",
    output_dir="./model",
    hidden_size=128,
    num_layers=2,
)

Invalid hidden_size/num_heads

# hidden_size must be divisible by num_heads
# This will raise ValueError:
config = TrainingConfig(
    hidden_size=256,
    num_heads=3,  # Error: 256 not divisible by 3
)

# Valid configuration:
config = TrainingConfig(
    hidden_size=256,
    num_heads=4,  # OK: 256 / 4 = 64
)

Error Detection Training

Overview

The error detection pipeline trains a token classification model that detects errors in a single forward pass (~10ms). Unlike the MLM pipeline which trains from scratch, this pipeline fine-tunes a pre-trained XLM-RoBERTa model. Training uses only clean data — the library generates synthetic errors from its existing YAML rules (homophones, typos, phonetic data).

Quick Start

from myspellchecker.training.error_detection import (
    ErrorDetectionPipeline,
    ErrorDetectionConfig,
)

config = ErrorDetectionConfig(
    input_file="corpus.txt",  # Clean text, one sentence per line
    output_dir="./detector",
    epochs=3,
)

pipeline = ErrorDetectionPipeline()
model_path = pipeline.run(config)
print(f"Detector saved to: {model_path}")

CLI Usage

# Basic training
myspellchecker train-detector -i corpus.txt -o ./detector/

# With custom parameters
myspellchecker train-detector -i corpus.txt -o ./detector/ \
  --epochs 5 --corruption-ratio 0.10 --batch-size 32

# Skip preprocessing (if corpus is already clean)
myspellchecker train-detector -i corpus.txt -o ./detector/ \
  --skip-preprocessing

ErrorDetectionConfig Parameters

Parameter	Type	Default	Description
`input_file`	str	Required	Path to clean corpus (one sentence per line)
`output_dir`	str	Required	Directory to save model artifacts
`base_model`	str	`xlm-roberta-base`	Pre-trained model to fine-tune
`epochs`	int	3	Number of training epochs
`batch_size`	int	16	Batch size per device
`learning_rate`	float	2e-5	Peak learning rate
`corruption_ratio`	float	0.15	Fraction of words to corrupt per sentence
`max_length`	int	256	Maximum sequence length
`keep_checkpoints`	bool	False	Keep intermediate checkpoints
`save_metrics`	bool	True	Save training metrics to JSON
`seed`	int	None	Random seed for reproducibility
`skip_preprocessing`	bool	False	Skip corpus preprocessing

Corpus Preprocessing

By default, the pipeline preprocesses the corpus before generating synthetic errors:

Zawgyi detection and conversion — Converts Zawgyi-encoded text to Unicode
Unicode normalization — NFC normalization, diacritic reordering, zero-width removal
Quality filtering — Removes non-Myanmar text, too-short/too-long lines

This is critical because the corpus defines ground truth labels — any Zawgyi or unnormalized text would be labeled as CORRECT, teaching the model to accept bad text. Use --skip-preprocessing only if your corpus is already clean Unicode.

Synthetic Error Generation

The SyntheticErrorGenerator creates training data by corrupting clean sentences:

Corruption Type	Weight	Source
Homophone swap	30%	`rules/homophones.yaml`
Medial confusion (ျ↔ြ, ွ↔ှ)	20%	Myanmar character sets
Similar char swap	15%	`phonetic_data.py` VISUAL_SIMILAR
Character deletion	15%	Asat, tone marks, vowels
Character insertion	10%	Double stacking, extra tone
Typo pattern	10%	`rules/typo_corrections.yaml` (inverted)

Each corrupted word is labeled ERROR (1), unchanged words labeled CORRECT (0).

Pipeline Steps

The full pipeline runs 6 steps:

Load Corpus — Read input file
Preprocess — Zawgyi conversion, normalization, filtering (skippable)
Generate Synthetic Errors — Corrupt words using YAML rules
Build Dataset — Tokenize with subword alignment, create labels
Train Token Classifier — Fine-tune XLM-RoBERTa
Export to ONNX — Quantized model for inference

Module Files

The error detection training pipeline is implemented in src/myspellchecker/training/error_detection/:

File	Description
`constants.py`	Labels (`CORRECT`/`ERROR`), corruption weights, default hyperparameters
`generator.py`	`SyntheticErrorGenerator` — rule-based corruption using YAML rules
`dataset.py`	`ErrorDetectionDataset` — torch `Dataset` with subword tokenization and label alignment
`iterable_dataset.py`	`ErrorDetectionIterableDataset` — streaming `IterableDataset` that reads pre-generated JSONL examples line-by-line, avoiding loading the entire corpus into RAM
`alignment.py`	Token-label alignment utilities — maps word-level labels to subword tokens using binary search (`O(T log W)`)
`trainer.py`	`ErrorDetectionTrainer` — fine-tunes XLM-RoBERTa for token classification
`pipeline.py`	`ErrorDetectionPipeline` + `ErrorDetectionConfig` — end-to-end orchestrator

Using the Trained Model

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.algorithm_configs import ErrorDetectorConfig

config = SpellCheckerConfig(
    error_detector=ErrorDetectorConfig(
        model_path="./detector/onnx/model.onnx",
        tokenizer_path="./detector/onnx",
        confidence_threshold=0.7,
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

Comparing Training Pipelines

Aspect	`train-model` (MLM)	`train-detector` (Token Classifier)
Purpose	Semantic validation	Error detection
Base model	Train from scratch	Fine-tune XLM-RoBERTa
Training data	Raw corpus	Clean corpus + synthetic errors
Preprocessing	None (user responsibility)	Built-in (Zawgyi, normalization)
Inference	N forward passes per sentence	Single forward pass
Speed	~200ms per sentence	~10ms per sentence
Output	Suggestions + error detection	Error detection only
CLI	`train-model`	`train-detector`

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​Overview

​1. Semantic Model (MLM) Training

​2. Error Detection Training

​Prerequisites

​Quick Start

​Model Architectures

​RoBERTa (Default)

​BERT

​Configuration Options

​TrainingConfig Parameters

​Architecture Constraints

​Learning Rate Scheduling

​Resume Training from Checkpoint

​Training Metrics

​Low-Level API

​ONNX Export

​Using Trained Models

​With SemanticChecker

​Standalone Inference

​CLI Usage

​Corpus Format

​GPU Support

​Batch Size by GPU VRAM

​Model Size vs Quality

​Best Practices

​Troubleshooting

​Memory Issues

​Slow Training

​Invalid hidden_size/num_heads

​Error Detection Training

​Overview

​Quick Start

​CLI Usage

​ErrorDetectionConfig Parameters

​Corpus Preprocessing

​Synthetic Error Generation

​Pipeline Steps

​Module Files

​Using the Trained Model

​Comparing Training Pipelines

​See Also

Overview

1. Semantic Model (MLM) Training

2. Error Detection Training

Prerequisites

Quick Start

Model Architectures

RoBERTa (Default)

BERT

Configuration Options

TrainingConfig Parameters

Architecture Constraints

Learning Rate Scheduling

Resume Training from Checkpoint

Training Metrics

Low-Level API

ONNX Export

Using Trained Models

With SemanticChecker

Standalone Inference

CLI Usage

Corpus Format

GPU Support

Batch Size by GPU VRAM

Model Size vs Quality

Best Practices

Troubleshooting

Memory Issues

Slow Training

Invalid hidden_size/num_heads

Error Detection Training

Overview

Quick Start

CLI Usage

ErrorDetectionConfig Parameters

Corpus Preprocessing

Synthetic Error Generation

Pipeline Steps

Module Files

Using the Trained Model

Comparing Training Pipelines

See Also