This page covers everything you need to build a dictionary — from a simple CLI command to advanced Python API usage with POS tagging, curated lexicons, and incremental updates.
CLI Reference
Basic Build
# Build from text corpus
myspellchecker build --input corpus.txt --output dict.db
# Build sample database for testing
myspellchecker build --sample
# Build with POS tagging
myspellchecker build --input corpus.txt --output dict.db --pos-tagger transformer
Build Options
| Option | Default | Description |
|---|
--input FILE | Required | Input corpus file (TXT, CSV, TSV, JSON, JSONL, Parquet) |
--output FILE | mySpellChecker-default.db | Output SQLite database path |
--sample | — | Build a small sample database (no input needed) |
--incremental | — | Update existing database instead of rebuilding |
--min-frequency N | 50 | Minimum word frequency to include |
--pos-tagger TYPE | rule_based | POS tagger: rule_based, viterbi, or transformer |
--pos-model NAME | — | HuggingFace model for transformer tagger |
--pos-device ID | — | GPU device ID for transformer tagger |
--num-workers N | CPU count | Parallel worker processes |
--batch-size N | 10000 | Records per processing batch |
--curated-input FILE | — | CSV file with trusted vocabulary words |
--word-engine TYPE | myword | Word segmentation engine: myword or crf |
--validate | — | Pre-flight validation of input (no build) |
Run myspellchecker build --help for additional flags including --work-dir, --keep-intermediate, --col, --json-key, --worker-timeout, and --verbose.
Python API
Basic Usage
from myspellchecker.data_pipeline import Pipeline
pipeline = Pipeline()
pipeline.build_database(
input_files=["corpus.txt"],
database_path="dict.db",
)
With Configuration
from myspellchecker.data_pipeline import Pipeline, PipelineConfig
config = PipelineConfig(
batch_size=10000, # Records per batch
num_shards=20, # Shards for ingestion
num_workers=4, # Parallel workers (None = auto-detect)
min_frequency=50, # Minimum word frequency to include
word_engine="myword", # Word segmentation engine ("myword", "crf", "transformer")
keep_intermediate=False, # Keep intermediate Arrow files
text_col="text", # Column name for CSV/TSV
json_key="text", # Key name for JSON
)
pipeline = Pipeline(config=config)
pipeline.build_database(
input_files=["corpus.txt"],
database_path="dict.db",
)
Building from Multiple Files
pipeline.build_database(
input_files=[
"general_corpus.txt",
"domain_specific.txt",
"organization_names.txt",
],
database_path="combined.db",
)
POS Tagging
Add POS tags to dictionary entries during build for grammar checking support:
# CLI: rule-based (fast, no dependencies)
myspellchecker build --input corpus.txt --output dict.db --pos-tagger rule_based
# CLI: transformer (highest accuracy, requires GPU)
myspellchecker build --input corpus.txt --output dict.db \
--pos-tagger transformer \
--pos-model chuuhtetnaing/myanmar-pos-model \
--pos-device 0
# Python API
from myspellchecker.core.config import POSTaggerConfig
config = PipelineConfig(
pos_tagger=POSTaggerConfig(
tagger_type="transformer",
model_name="chuuhtetnaing/myanmar-pos-model",
device=0,
),
)
pipeline = Pipeline(config=config)
pipeline.build_database(["corpus.txt"], "dict.db")
POS Inference on Existing Database
Apply rule-based POS inference to an existing database without rebuilding:
from myspellchecker.data_pipeline import DatabasePackager
packager = DatabasePackager.from_existing("dictionary.db")
stats = packager.apply_inferred_pos(
min_frequency=0,
min_confidence=0.0,
)
packager.close()
print(f"Inferred POS for {stats['inferred']} words")
Curated Lexicons
Curated words are trusted vocabulary inserted directly into the database before corpus processing. They are always recognized as valid regardless of corpus frequency.
Create a Lexicon CSV
word
ဆေးရုံ
ဆရာဝန်
လူနာ
ကုမ္ပဏီ
Build with Curated Words
myspellchecker build --input corpus.txt --output dict.db \
--curated-input curated_lexicon.csv
How Curated Words are Processed
| Scenario | frequency | is_curated |
|---|
| Curated only (not in corpus) | 0 | 1 |
| Curated + corpus overlap | corpus_freq | 1 |
| Corpus only | corpus_freq | 0 |
Curated words are inserted first (is_curated=1, frequency=0), then corpus processing updates frequency while preserving the is_curated flag via MAX().
Incremental Updates
Add new data to an existing dictionary without rebuilding from scratch:
myspellchecker build --input new_data.txt --output existing.db --incremental
pipeline.build_database(
input_files=["new_data.txt"],
database_path="existing.db",
incremental=True,
)
The pipeline tracks processed files in a processed_files table to avoid reprocessing.
Output Database Schema
-- Core tables
syllables(id, syllable, frequency)
words(id, word, syllable_count, frequency, pos_tag, is_curated,
inferred_pos, inferred_confidence, inferred_source)
bigrams(id, word1_id, word2_id, probability, count)
trigrams(id, word1_id, word2_id, word3_id, probability, count)
-- POS probability tables (for Viterbi tagger)
pos_unigrams(pos, probability)
pos_bigrams(pos1, pos2, probability)
pos_trigrams(pos1, pos2, pos3, probability)
-- File tracking (for incremental builds)
processed_files(path, mtime, size)
Query Examples
# Lookup word frequency
cursor.execute("SELECT frequency FROM words WHERE word = ?", ("မြန်မာ",))
# Get bigram probability
cursor.execute("""
SELECT b.probability
FROM bigrams b
JOIN words w1 ON b.word1_id = w1.id
JOIN words w2 ON b.word2_id = w2.id
WHERE w1.word = ? AND w2.word = ?
""", ("ထမင်း", "စား"))
Verification
from myspellchecker.data_pipeline import DatabasePackager
packager = DatabasePackager(input_dir, database_path)
packager.connect()
packager.verify_database()
packager.print_stats()
packager.close()
Build Time
| Corpus Size | Build Time | Peak Memory |
|---|
| 1M words | ~30s | ~200MB |
| 10M words | ~5min | ~500MB |
| 100M words | ~45min | ~2GB |
For large corpora, see Optimization for DuckDB acceleration (3-50x faster frequency counting) and Cython parallelization.
# Tune for large corpora
myspellchecker build --input huge_corpus.txt \
--num-workers 8 \
--batch-size 500000
See Also