.db) used by mySpellChecker at runtime.
It is designed to handle large datasets (10GB+) by using sharding, intermediate binary formats (Arrow), and resume capabilities.
Usage
CLI Usage
The easiest way to use the pipeline is via the command line interface:Python API Usage
You can also invoke the pipeline programmatically:Architecture
The pipeline executes in 4 distinct steps. It tracks file modification times to skip steps that are already up-to-date (Resume Capability).Ingestion
- Input: Raw text files (
.txt,.csv,.tsv,.json,.jsonl,.parquet). - Process:
- Reads files in chunks.
- Normalizes text (Unicode normalization).
- Splits into shards for parallel processing.
- Output:
raw_shards/*.arrow(Apache Arrow files).
Segmentation
- Input:
raw_shards/*.arrow - Process:
- Iterates through shards.
- Segments text into sentences and syllables using the configured
word_engine.- Default:
"myword"(bothPipelineConfigand CLI)
- Default:
- Applies POS tagging using the configured
pos_tagger(Rule-Based, Viterbi, or Transformer).
- Output:
segmented_corpus.arrow
Frequency Building
- Input:
segmented_corpus.arrow - Process:
- Counts occurrences of Syllables, Words, Bigrams, and Trigrams.
- Calculates POS tag probabilities (Unigram/Bigram/Trigram).
- Filters items below
min_frequency.
- Output: TSV files (e.g.,
word_frequencies.tsv,bigram_probabilities.tsv).
Configuration
ThePipelineConfig class supports fine-tuning:
| Parameter | Default | Description |
|---|---|---|
min_frequency | 50 | Words appearing fewer times than this are discarded. |
num_workers | None (auto-detect at runtime) | Number of parallel processes for ingestion. |
batch_size | 10,000 | Rows per Arrow batch. |
disk_space_check_mb | 51200 | Minimum free disk space required in MB (50 GB). Set to 0 to disable. |
keep_intermediate | False | If True, temporary files are not deleted after success. |
Incremental Updates
The pipeline supports Incremental Updates to add new data to an existing database without rebuilding from scratch:Curated Lexicon Support
You can mark specific words as trusted/curated (is_curated=1) in the database using the --curated-input option:
word column header:
is_curated flag:
- Words from POS seed file →
is_curated=1(with POS tags) - Words from curated lexicon →
is_curated=1 - Other corpus words →
is_curated=0
scripts/merge_vocabulary.py utility to prepare curated lexicons by merging and deduplicating vocabulary files from multiple sources.
See Custom Dictionaries Guide for detailed usage examples.