1. Caching
TheSQLiteProvider relies on caching to avoid disk I/O.
- Setting:
provider_config.cache_size(default 1024) viaProviderConfig. - Advice: If you have RAM, increase this to 10,000 or more. This drastically improves speed for common words.
2. Validation Levels
If you only need to catch typos (invalid characters/syllables) and donβt care about context errors:- Action: Set
use_context_checker=Falsein config. - Result: Removes the N-gram lookup step, providing a speed boost.
3. Async Execution
For web servers (FastAPI, Django, etc.), always usecheck_async.
- Why: The core check logic is CPU-bound. Running it synchronously blocks the event loop.
check_asyncoffloads it to a thread.
4. Memory vs. Disk
MemoryProvider: If startup time (~5-10s) is acceptable and you have 500MB+ RAM, this is the fastest option.SQLiteProvider: Instant startup, low RAM. Best for CLIs or limited environments.
5. Batch Processing
Usecheck_batch(texts) instead of a loop.
- While currently a wrapper, future versions may parallelize this operation.
6. Data Pipeline Performance (Building Dictionaries)
Building a dictionary from a massive corpus (10GB+) can be intensive.- Sharding: The pipeline automatically shards input for parallel processing. Ensure your machine has multiple cores available.
- Disk I/O: Use a fast SSD. Intermediate files (shards) are written to disk to keep RAM usage low.
- Word Engine: The default
mywordengine is a custom rule-based segmenter. For maximum speed during build, ensure the Cython extensions are compiled properly, which speeds up the segmentation loop by ~10x.
7. POS Tagging & Segmentation Performance
Tagger Selection
The choice of POS tagger has a massive impact on throughput:| Tagger | Speed | Description |
|---|---|---|
| Rule-Based | β‘β‘β‘ Fast | 100K+ words/s. Best for general use and CLIs. |
| Viterbi | β‘β‘ Medium | ~20K words/s. Pure Python. Good balance for CPU-only environments. |
| Transformer | π’/β‘ Slow/Fast | ~5K words/s (CPU) vs ~50K words/s (GPU). Highest accuracy but heavy resource usage. |
transformer only if you have a GPU (device=0) or if accuracy is paramount and latency is secondary.
Joint Segmentation
Enablingjoint.enabled=True combines segmentation and tagging into a unified Viterbi path.
- Cost: Slower than sequential mode because the state space is larger (Words Γ Tags).
- Beam Width: Controlled by
joint.beam_width(default 15).- Lower (e.g., 5-10): Faster, slightly less accurate.
- Higher (e.g., 20+): Slower, diminishing returns on accuracy.
8. Connection Pool Configuration
For high-concurrency scenarios, tuning the connection pool is essential.Pool Sizing
| Setting | Low Traffic | High Traffic | Web Server |
|---|---|---|---|
min_size | 1-2 | 2-5 | 5-10 |
max_size | 5 | 10-20 | 20-50 |
timeout | 5.0 | 10.0 | 15.0 |
SQLite Timeout
Handle database lock contention withsqlite_timeout:
- Default: 30 seconds
- High contention: Increase to 60-120 seconds
- Low contention: Decrease to 5-10 seconds for faster failure
Connection Age
Refresh stale connections to prevent memory leaks:9. N-gram Context Checker Performance
Smoothing Strategy
Choose smoothing based on your data:10. Edit Distance Performance
The library uses Myanmar-specific weighted edit distance for better accuracy. For maximum speed:- Ensure Cython extensions are compiled (
python setup.py build_ext --inplace) - Cython version is ~10x faster than pure Python fallback
- Use
damerau_levenshtein_distancefor integer distances (fastest) - Use
weighted_damerau_levenshtein_distancefor float distances (more accurate)