Skip to main content
mySpellChecker is fast by default, but a few configuration decisions β€” caching, validation level, provider choice, and async usage β€” can make a significant difference in production.

1. Caching

The SQLiteProvider relies on caching to avoid disk I/O.
  • Setting: provider_config.cache_size (default 1024) via ProviderConfig.
  • Advice: If you have RAM, increase this to 10,000 or more. This drastically improves speed for common words.

2. Validation Levels

If you only need to catch typos (invalid characters/syllables) and don’t care about context errors:
  • Action: Set use_context_checker=False in config.
  • Result: Removes the N-gram lookup step, providing a speed boost.

3. Async Execution

For web servers (FastAPI, Django, etc.), always use check_async.
  • Why: The core check logic is CPU-bound. Running it synchronously blocks the event loop. check_async offloads it to a thread.

4. Memory vs. Disk

  • MemoryProvider: If startup time (~5-10s) is acceptable and you have 500MB+ RAM, this is the fastest option.
  • SQLiteProvider: Instant startup, low RAM. Best for CLIs or limited environments.

5. Batch Processing

Use check_batch(texts) instead of a loop.
  • While currently a wrapper, future versions may parallelize this operation.

6. Data Pipeline Performance (Building Dictionaries)

Building a dictionary from a massive corpus (10GB+) can be intensive.
  • Sharding: The pipeline automatically shards input for parallel processing. Ensure your machine has multiple cores available.
  • Disk I/O: Use a fast SSD. Intermediate files (shards) are written to disk to keep RAM usage low.
  • Word Engine: The default myword engine is a custom rule-based segmenter. For maximum speed during build, ensure the Cython extensions are compiled properly, which speeds up the segmentation loop by ~10x.

7. POS Tagging & Segmentation Performance

Tagger Selection

The choice of POS tagger has a massive impact on throughput:
TaggerSpeedDescription
Rule-Based⚑⚑⚑ Fast100K+ words/s. Best for general use and CLIs.
Viterbi⚑⚑ Medium~20K words/s. Pure Python. Good balance for CPU-only environments.
Transformer🐒/⚑ Slow/Fast~5K words/s (CPU) vs ~50K words/s (GPU). Highest accuracy but heavy resource usage.
Advice: Use transformer only if you have a GPU (device=0) or if accuracy is paramount and latency is secondary.

Joint Segmentation

Enabling joint.enabled=True combines segmentation and tagging into a unified Viterbi path.
  • Cost: Slower than sequential mode because the state space is larger (Words Γ— Tags).
  • Beam Width: Controlled by joint.beam_width (default 15).
    • Lower (e.g., 5-10): Faster, slightly less accurate.
    • Higher (e.g., 20+): Slower, diminishing returns on accuracy.

8. Connection Pool Configuration

For high-concurrency scenarios, tuning the connection pool is essential.

Pool Sizing

from myspellchecker.core.config.validation_configs import ConnectionPoolConfig

pool_config = ConnectionPoolConfig(
    min_size=2,      # Minimum connections to maintain
    max_size=10,     # Maximum connections allowed
    timeout=5.0,     # Checkout timeout in seconds
)
SettingLow TrafficHigh TrafficWeb Server
min_size1-22-55-10
max_size510-2020-50
timeout5.010.015.0

SQLite Timeout

Handle database lock contention with sqlite_timeout:
pool_config = ConnectionPoolConfig(
    sqlite_timeout=30.0,  # Wait up to 30s for database lock
)
  • Default: 30 seconds
  • High contention: Increase to 60-120 seconds
  • Low contention: Decrease to 5-10 seconds for faster failure

Connection Age

Refresh stale connections to prevent memory leaks:
pool_config = ConnectionPoolConfig(
    max_connection_age=3600.0,  # Recreate connections after 1 hour
)

9. N-gram Context Checker Performance

Smoothing Strategy

Choose smoothing based on your data:
from myspellchecker.algorithms.ngram_context_checker import SmoothingStrategy

# Stupid Backoff (default) - fast and effective
checker = NgramContextChecker(
    provider=provider,
    smoothing_strategy=SmoothingStrategy.STUPID_BACKOFF,
)

# No smoothing - fastest (for pre-smoothed data)
checker = NgramContextChecker(
    provider=provider,
    smoothing_strategy=SmoothingStrategy.NONE,
)

10. Edit Distance Performance

The library uses Myanmar-specific weighted edit distance for better accuracy. For maximum speed:
  • Ensure Cython extensions are compiled (python setup.py build_ext --inplace)
  • Cython version is ~10x faster than pure Python fallback
  • Use damerau_levenshtein_distance for integer distances (fastest)
  • Use weighted_damerau_levenshtein_distance for float distances (more accurate)