Performance Tuning

mySpellChecker is fast by default, but a few configuration decisions — caching, validation level, provider choice, and async usage — can make a significant difference in production.

1. Caching

The SQLiteProvider relies on caching to avoid disk I/O.

Setting: provider_config.cache_size (default 1024) via ProviderConfig.
Advice: If you have RAM, increase this to 10,000 or more. This drastically improves speed for common words.

2. Validation Levels

If you only need to catch typos (invalid characters/syllables) and don’t care about context errors:

Action: Set use_context_checker=False in config.
Result: Removes the N-gram lookup step, providing a speed boost.

3. Async Execution

For web servers (FastAPI, Django, etc.), always use check_async.

Why: The core check logic is CPU-bound. Running it synchronously blocks the event loop. check_async offloads it to a thread.

4. Memory vs. Disk

MemoryProvider: If startup time (~5-10s) is acceptable and you have 500MB+ RAM, this is the fastest option.
SQLiteProvider: Instant startup, low RAM. Best for CLIs or limited environments.

5. Batch Processing

Use check_batch(texts) instead of a loop.

While currently a wrapper, future versions may parallelize this operation.

6. Data Pipeline Performance (Building Dictionaries)

Building a dictionary from a massive corpus (10GB+) can be intensive.

Sharding: The pipeline automatically shards input for parallel processing. Ensure your machine has multiple cores available.
Disk I/O: Use a fast SSD. Intermediate files (shards) are written to disk to keep RAM usage low.
Word Engine: The default myword engine is a custom rule-based segmenter. For maximum speed during build, ensure the Cython extensions are compiled properly, which speeds up the segmentation loop by ~10x.

7. POS Tagging & Segmentation Performance

Tagger Selection

The choice of POS tagger has a massive impact on throughput:

Tagger	Speed	Description
Rule-Based	⚡⚡⚡ Fast	100K+ words/s. Best for general use and CLIs.
Viterbi	⚡⚡ Medium	~20K words/s. Pure Python. Good balance for CPU-only environments.
Transformer	🐢/⚡ Slow/Fast	~5K words/s (CPU) vs ~50K words/s (GPU). Highest accuracy but heavy resource usage.

Advice: Use transformer only if you have a GPU (device=0) or if accuracy is paramount and latency is secondary.

Joint Segmentation

Enabling joint.enabled=True combines segmentation and tagging into a unified Viterbi path.

Cost: Slower than sequential mode because the state space is larger (Words × Tags).
Beam Width: Controlled by joint.beam_width (default 15).
- Lower (e.g., 5-10): Faster, slightly less accurate.
- Higher (e.g., 20+): Slower, diminishing returns on accuracy.

8. Connection Pool Configuration

For high-concurrency scenarios, tuning the connection pool is essential.

Pool Sizing

from myspellchecker.core.config.validation_configs import ConnectionPoolConfig

pool_config = ConnectionPoolConfig(
    min_size=2,      # Minimum connections to maintain
    max_size=10,     # Maximum connections allowed
    timeout=5.0,     # Checkout timeout in seconds
)

Setting	Low Traffic	High Traffic	Web Server
`min_size`	1-2	2-5	5-10
`max_size`	5	10-20	20-50
`timeout`	5.0	10.0	15.0

SQLite Timeout

Handle database lock contention with sqlite_timeout:

pool_config = ConnectionPoolConfig(
    sqlite_timeout=30.0,  # Wait up to 30s for database lock
)

Default: 30 seconds
High contention: Increase to 60-120 seconds
Low contention: Decrease to 5-10 seconds for faster failure

Connection Age

Refresh stale connections to prevent memory leaks:

pool_config = ConnectionPoolConfig(
    max_connection_age=3600.0,  # Recreate connections after 1 hour
)

9. N-gram Context Checker Performance

Smoothing Strategy

Choose smoothing based on your data:

from myspellchecker.algorithms.ngram_context_checker import SmoothingStrategy

# Stupid Backoff (default) - fast and effective
checker = NgramContextChecker(
    provider=provider,
    smoothing_strategy=SmoothingStrategy.STUPID_BACKOFF,
)

# No smoothing - fastest (for pre-smoothed data)
checker = NgramContextChecker(
    provider=provider,
    smoothing_strategy=SmoothingStrategy.NONE,
)

10. Edit Distance Performance

The library uses Myanmar-specific weighted edit distance for better accuracy. For maximum speed:

Ensure Cython extensions are compiled (python setup.py build_ext --inplace)
Cython version is ~10x faster than pure Python fallback
Use damerau_levenshtein_distance for integer distances (fastest)
Use weighted_damerau_levenshtein_distance for float distances (more accurate)

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

1. Caching

2. Validation Levels

3. Async Execution

4. Memory vs. Disk

5. Batch Processing

6. Data Pipeline Performance (Building Dictionaries)

7. POS Tagging & Segmentation Performance

Tagger Selection

Joint Segmentation

8. Connection Pool Configuration

Pool Sizing

SQLite Timeout

Connection Age

9. N-gram Context Checker Performance

Smoothing Strategy

10. Edit Distance Performance

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​1. Caching

​2. Validation Levels

​3. Async Execution

​4. Memory vs. Disk

​5. Batch Processing

​6. Data Pipeline Performance (Building Dictionaries)

​7. POS Tagging & Segmentation Performance

​Tagger Selection

​Joint Segmentation

​8. Connection Pool Configuration

​Pool Sizing

​SQLite Timeout

​Connection Age

​9. N-gram Context Checker Performance

​Smoothing Strategy

​10. Edit Distance Performance

1. Caching

2. Validation Levels

3. Async Execution

4. Memory vs. Disk

5. Batch Processing

6. Data Pipeline Performance (Building Dictionaries)

7. POS Tagging & Segmentation Performance

Tagger Selection

Joint Segmentation

8. Connection Pool Configuration

Pool Sizing

SQLite Timeout

Connection Age

9. N-gram Context Checker Performance

Smoothing Strategy

10. Edit Distance Performance