SemanticChecker. While the standard N-gram checker works well for local context, these models (based on Transformers) capture long-range dependencies and semantic meaning.
Overview
The pipeline automates the entire process:- Tokenizer Training: creating a vocabulary from your specific corpus.
- Model Training: Pre-training a transformer model (Masked Language Modeling).
- Export: Converting the model to ONNX format for fast, dependency-light inference.
Usage
CLI Usage
Python API Usage
Pipeline Stages
1. Tokenizer Training
- Goal: Create a subword tokenizer optimized for Myanmar text.
- Algorithm: Byte-Level BPE (Byte-Pair Encoding).
- Output:
tokenizer.json
2. Language Model Training
- Goal: Learn the probability distribution of words in context.
- Architecture: RoBERTa or BERT (Encoder-only transformer, selected via
ModelArchitectureenum). - Task: Masked Language Modeling (MLM). Random words are masked, and the model attempts to predict them.
- Hyperparameters:
hidden_size: Dimension of the embeddings (default: 256).num_layers: Number of transformer blocks (default: 4).num_heads: Attention heads (default: 4).
3. ONNX Export & Quantization
- Goal: Optimize the model for production use.
- Process:
- Converts the PyTorch dynamic graph to a static ONNX graph.
- Quantization: Converts 32-bit floating point weights to 8-bit unsigned integers (QUInt8). This reduces model size by 4x and speeds up CPU inference significantly with minimal accuracy loss.
- Output:
model.onnx
Hardware Requirements
- Training: A GPU (NVIDIA CUDA or Mac MPS) is highly recommended but not strictly required. The pipeline automatically detects available accelerators.
- Inference: The resulting ONNX models are designed to run efficiently on standard CPUs.