Skip to main content
The myspellchecker CLI is installed with the package and provides commands for checking text, building dictionaries, training AI models, segmenting text, and managing configuration.

Installation

The CLI is installed automatically with the package:
pip install myspellchecker
myspellchecker --help

Commands Overview

CommandDescription
checkCheck text for spelling errors (default when no command given)
buildBuild dictionary database from corpus
train-modelTrain a custom semantic model
train-detectorTrain an error detection model (token classification)
segmentSegment text into words
configManage configuration
infer-posInfer POS tags for database
completionGenerate shell completion

check

Check text for spelling errors. This is the default command: if no subcommand is recognized, check is assumed.

Usage

myspellchecker check [OPTIONS] [INPUT]

Arguments

ArgumentDescription
INPUTInput file path (or stdin if omitted)

Options

OptionShortDescription
--output-oOutput file path (default: stdout)
--format-fOutput format: json, text, csv, rich (default: rich for TTY, json for pipes)
--colorForce color output even when not a TTY
--no-colorDisable color output
--levelValidation level: syllable, word (default: syllable)
--dbCustom database path
--no-phoneticDisable phonetic matching
--no-contextDisable context checking
--preset-pConfiguration preset: default, fast, accurate, minimal, strict
--verbose-vEnable verbose logging
--config-cPath to configuration file (YAML or JSON format)
Note: --color and --no-color are mutually exclusive.

Examples

# Check a file
myspellchecker check document.txt

# Check with JSON output
myspellchecker check document.txt -f json -o results.json

# Check from stdin
echo "မြန်မာနိုင်ငံ" | myspellchecker check

# Use specific database
myspellchecker check document.txt --db custom.db

# Fast checking (syllable only)
myspellchecker check document.txt --level syllable

# Thorough checking
myspellchecker check document.txt --level word -p accurate

# With custom config file
myspellchecker check document.txt -c config.yaml

# Force color output in a pipe
myspellchecker check document.txt --color | less -R

# Rich formatted output (default in terminal)
myspellchecker check document.txt -f rich

Output Formats

Rich (default in terminal): Colored, formatted output with panels and tables using the Rich library. Auto-selected when running in an interactive terminal. Text (grep-like):
# WARNING: Myanmar text may not render correctly in your terminal.
# Use a text editor with proper font support to view this output.

document.txt:1:5: invalid_syllable 'xyz' -> Try: [abc, def, ghi]

# Summary: 1 errors found in 10 lines.
JSON (default in pipes):
{
  "summary": {
    "total_errors": 1,
    "total_lines": 10
  },
  "results": [
    {
      "file": "document.txt",
      "line": 1,
      "text": "...",
      "has_errors": true,
      "errors": [
        {
          "text": "xyz",
          "position": 5,
          "error_type": "invalid_syllable",
          "suggestions": ["abc", "def", "ghi"],
          "confidence": 1.0
        }
      ]
    }
  ]
}
CSV:
file,line,position,error_type,text,suggestions
document.txt,1,5,invalid_syllable,xyz,"abc,def,ghi"

build

Build a dictionary database from corpus files.

Usage

myspellchecker build [OPTIONS]

Options

OptionShortDescription
--input-iInput corpus file(s) (UTF-8 encoded text, CSV, TSV, JSON)
--output-oOutput database path (default: mySpellChecker-default.db)
--work-dirDirectory for intermediate files (default: temp_build)
--keep-intermediateKeep intermediate files after build
--sampleGenerate sample corpus for testing
--colColumn name/index for CSV/TSV files (default: text)
--json-keyKey name for JSON objects (default: text)
--pos-taggerPOS tagger type: rule_based, viterbi, transformer
--pos-modelHuggingFace model ID or local path for transformer tagger
--pos-deviceDevice for transformer POS tagger: -1=CPU, 0+=GPU (default: -1)
--incrementalPerform incremental update on existing database
--curated-inputPath to curated lexicon CSV file (words marked as is_curated=1)
--word-engineWord segmentation engine: myword, crf, transformer (default: myword)
--seg-modelHuggingFace model ID or local path for transformer word segmentation (only used when --word-engine=transformer)
--seg-deviceDevice for transformer word segmenter: -1=CPU, 0+=GPU (default: -1, only used when --word-engine=transformer)
--validateValidate inputs only without building (pre-flight check)
--min-frequencyMinimum word frequency threshold (default: from config)
--num-workersNumber of parallel workers (default: auto-detect based on CPU cores)
--batch-sizeBatch size for processing (default: 10000)
--worker-timeoutWorker timeout in seconds for parallel processing (default: 300)
--no-dedupDisable line-level deduplication during ingestion
--no-desegmentKeep word segmentation markers in output
--verbose-vEnable verbose logging with detailed timing breakdowns

Examples

# Build sample database
myspellchecker build --sample

# Build from corpus file
myspellchecker build -i corpus.txt -o dictionary.db

# Build from multiple files
myspellchecker build -i "data/*.txt" "extra/*.json"

# Build from directory (auto-detects txt, json, jsonl)
myspellchecker build -i ./corpus/ -o dictionary.db

# Validate before building
myspellchecker build -i corpus.txt --validate

# With POS tagging
myspellchecker build -i corpus.txt --pos-tagger viterbi

# Incremental update
myspellchecker build -i new_data.txt -o dictionary.db --incremental

# Filter by frequency
myspellchecker build -i corpus.txt -o dictionary.db --min-frequency 5

# Build with curated lexicon
myspellchecker build -i corpus.txt --curated-input data/curated_lexicon.csv -o dictionary.db

# Combine corpus with curated lexicon and transformer POS tagger
myspellchecker build -i corpus.txt --curated-input data/curated_lexicon.csv \
  --pos-tagger transformer --pos-device 0 -o dictionary.db

Build Process

  1. Ingestion: Read and parse input files
  2. Segmentation: Break text into syllables and words
  3. Frequency Calculation: Count occurrences and N-grams
  4. POS Tagging: Tag words with part-of-speech (if enabled)
  5. Packaging: Create optimized SQLite database

train-model

Train semantic models for context checking.

Usage

myspellchecker train-model [OPTIONS]

Options

OptionShortDescription
--input-iInput corpus file (required; raw text, one sentence per line)
--output-oOutput directory for the model (required)
--architecture-aModel architecture: roberta, bert (default: roberta)
--epochsNumber of training epochs (default: 5)
--batch-sizeTraining batch size (default: 16)
--learning-ratePeak learning rate (default: 5e-5)
--warmup-ratioRatio of steps for LR warmup (default: 0.1)
--weight-decayWeight decay for optimizer (default: 0.01)
--hidden-sizeSize of hidden layers (default: 256)
--layersNumber of transformer layers (default: 4)
--headsNumber of attention heads (default: 4)
--max-lengthMaximum sequence length (default: 128)
--vocab-sizeTokenizer vocabulary size (default: 15000)
--min-frequencyMinimum token frequency (default: 2)
--resumeResume training from checkpoint directory
--keep-checkpointsKeep intermediate PyTorch checkpoints after export
--no-metricsDisable saving training metrics to JSON

Architectures

ArchitectureDescription
robertaRoBERTa (default) - Dynamic masking, no NSP
bertBERT - Static masking, with NSP capability

Examples

# Train with default settings (RoBERTa architecture)
myspellchecker train-model -i corpus.txt -o ./models/

# Train BERT model with more epochs
myspellchecker train-model -i corpus.txt -o ./models/ --architecture bert --epochs 10

# Train with custom hyperparameters
myspellchecker train-model -i corpus.txt -o ./models/ \
    --learning-rate 3e-5 --warmup-ratio 0.1 --weight-decay 0.01

# Train larger model
myspellchecker train-model -i corpus.txt -o ./models/ \
    --hidden-size 512 --layers 6 --heads 8

# Resume training from checkpoint
myspellchecker train-model -i corpus.txt -o ./models/ \
    --resume ./models/checkpoints/checkpoint-500

# Keep checkpoints and disable metrics
myspellchecker train-model -i corpus.txt -o ./models/ \
    --keep-checkpoints --no-metrics

train-detector

Train an error detection model using token classification (fine-tunes XLM-RoBERTa).

Usage

myspellchecker train-detector [OPTIONS]

Options

OptionShortDescription
--input-iInput corpus file (required; clean text, one sentence per line)
--output-oOutput directory for the model (required)
--base-modelBase model for fine-tuning (default: xlm-roberta-base)
--epochsNumber of training epochs (default: 3)
--batch-sizeTraining batch size (default: 16)
--learning-ratePeak learning rate (default: 2e-5)
--corruption-ratioFraction of words to corrupt per sentence (default: 0.15)
--max-lengthMaximum sequence length (default: 256)
--keep-checkpointsKeep intermediate PyTorch checkpoints after ONNX export
--no-metricsDisable saving training metrics to JSON
--seedRandom seed for reproducibility
--skip-preprocessingSkip corpus preprocessing (Zawgyi conversion, normalization)

Examples

# Train with default settings
myspellchecker train-detector -i corpus.txt -o ./detector/

# Train with more epochs and lower corruption
myspellchecker train-detector -i corpus.txt -o ./detector/ \
    --epochs 5 --corruption-ratio 0.10

# Train with custom hyperparameters
myspellchecker train-detector -i corpus.txt -o ./detector/ \
    --learning-rate 3e-5 --batch-size 32

# Skip preprocessing for already-clean corpus
myspellchecker train-detector -i corpus.txt -o ./detector/ \
    --skip-preprocessing

# Keep checkpoints for debugging
myspellchecker train-detector -i corpus.txt -o ./detector/ \
    --keep-checkpoints

Training Process

  1. Load Corpus — Read input file (one sentence per line)
  2. Preprocess — Zawgyi conversion, Unicode normalization, quality filtering
  3. Generate Errors — Create synthetic errors from YAML rules
  4. Build Dataset — Tokenize with subword label alignment
  5. Train Model — Fine-tune XLM-RoBERTa for token classification
  6. Export to ONNX — Quantize and export for inference

segment

Segment text into words and optionally tag with POS.

Usage

myspellchecker segment [OPTIONS] [INPUT]

Options

OptionShortDescription
--output-oOutput file path (default: stdout)
--format-fOutput format: text, json, tsv (default: text)
--tagInclude POS tags (uses joint segmentation-tagging)
--dbCustom database path
--verbose-vEnable verbose logging

Examples

# Segment text (default text format)
myspellchecker segment document.txt

# Output as JSON
myspellchecker segment document.txt -f json

# Output as TSV
myspellchecker segment document.txt -f tsv

# With POS tags
myspellchecker segment document.txt --tag

# From stdin
echo "မြန်မာနိုင်ငံ" | myspellchecker segment

Output

Word mode with tags:
မြန်မာ/N နိုင်ငံ/N

config

Manage configuration files.

Usage

myspellchecker config [SUBCOMMAND]

Subcommands

SubcommandDescription
initCreate a new configuration file with defaults
showShow configuration file search paths and current config

config init Options

OptionDescription
--pathPath for configuration file (default: ~/.config/myspellchecker/myspellchecker.yaml)
--forceOverwrite existing configuration file

Examples

# Create config file (default location)
myspellchecker config init

# Create config file at custom path
myspellchecker config init --path ./myspellchecker.yaml

# Overwrite existing config file
myspellchecker config init --force

# Show current config and search paths
myspellchecker config show

Configuration File Locations

Configuration files are searched in this order:
  1. Path specified with --config flag
  2. Current directory: myspellchecker.yaml, myspellchecker.yml, or myspellchecker.json
  3. User config directory: ~/.config/myspellchecker/myspellchecker.{yaml,yml,json}

infer-pos

Infer POS tags for untagged words in the database using a rule-based engine.

Usage

myspellchecker infer-pos [OPTIONS]

Options

OptionShortDescription
--dbDatabase path to update with inferred POS tags (required)
--min-frequencyMinimum word frequency for inference (default: 0, infer all)
--min-confidenceMinimum confidence threshold, 0.0-1.0 (default: 0.0)
--include-taggedAlso infer for words that already have pos_tag (updates inferred_pos only)
--dry-runShow statistics without modifying the database
--verbose-vEnable verbose output with detailed statistics

Inference Sources

SourceDescription
numeral_detectionMyanmar numerals and numeral words
prefix_patternWords with prefix patterns (e.g., အ prefix -> Noun)
proper_noun_suffixProper noun suffixes (country, city names)
ambiguous_registryKnown ambiguous words (multi-POS)
morphologicalSuffix-based morphological analysis

Examples

# Infer POS tags for all untagged words
myspellchecker infer-pos --db dictionary.db

# Infer only for high-frequency words
myspellchecker infer-pos --db dictionary.db --min-frequency 10

# Set minimum confidence threshold
myspellchecker infer-pos --db dictionary.db --min-confidence 0.7

# Preview changes without modifying database
myspellchecker infer-pos --db dictionary.db --dry-run

# Include already-tagged words for re-inference
myspellchecker infer-pos --db dictionary.db --include-tagged

completion

Generate shell completion scripts.

Usage

myspellchecker completion --shell [bash|zsh|fish]

Options

OptionDescription
--shellShell type: bash, zsh, fish (default: bash)

Examples

# Generate bash completion
myspellchecker completion --shell bash > ~/.bash_completion.d/myspellchecker
source ~/.bash_completion.d/myspellchecker

# Generate zsh completion
myspellchecker completion --shell zsh > ~/.zsh/completions/_myspellchecker

# Generate fish completion
myspellchecker completion --shell fish > ~/.config/fish/completions/myspellchecker.fish

Global Options

Available for all commands:
OptionDescription
--helpShow help message
Note: --verbose/-v is available on most subcommands (check, build, segment, infer-pos) but is defined per-subcommand, not globally.

Exit Codes

CodeMeaning
0Success (no errors found, or validation passed)
1General runtime error (configuration, data loading, etc.)
2Invalid arguments, file not found, or permission error
130Process interrupted (Ctrl+C)

Configuration File

Create ~/.config/myspellchecker/myspellchecker.yaml:
# Database path (required - no bundled database included)
database: /path/to/your/custom.db

# Use a preset (default, fast, accurate, minimal, strict)
preset: default

# Core settings
max_edit_distance: 2
max_suggestions: 5

# Feature toggles
use_phonetic: true
use_context_checker: true

# Provider configuration
provider_config:
  pool_max_size: 5
  pool_timeout: 5.0

Environment Variables

VariableDescription
MYSPELL_DATABASE_PATHDefault database path
MYSPELL_MAX_EDIT_DISTANCEMax edit distance (1-3)
MYSPELL_USE_CONTEXT_CHECKEREnable context validation (true/false)

Next Steps