Overview
mySpellChecker uses Cython for performance-critical operations:| Module | Purpose | Performance Gain |
|---|---|---|
batch_processor.pyx | Parallel batch processing | 5-10x |
frequency_counter.pyx | Fast frequency calculations | 3-5x |
normalize_c.pyx | Text normalization | 2-3x |
edit_distance_c.pyx | Levenshtein distance | 4-6x |
viterbi.pyx | Viterbi algorithm for POS | 3-4x |
word_segment.pyx | Word segmentation | 2-3x |
mmap_reader.pyx | Memory-mapped file reading | 2-4x |
syllable_rules_c.pyx | Syllable validation rules | 2-3x |
ingester_c.pyx | Corpus ingestion | 2-3x |
repair_c.pyx | Segmentation repair | 2-3x |
tsv_reader_c.pyx | TSV file reading | 2-3x |
Prerequisites
Required Tools
Verify Installation
Project Structure
File Types
| Extension | Purpose | Git Tracked? |
|---|---|---|
.pyx | Cython source code | Yes |
.pxd | C-level declarations (like C headers) | Yes |
.py | Python wrapper/fallback | Yes |
.cpp | Generated C++ code | No |
.so / .pyd | Compiled binary | No |
Building Cython Extensions
Development Build
Build with Debug Symbols
Build Options
Thesetup.py automatically detects:
- OpenMP availability (macOS requires
brew install libomp) - C++ compiler capabilities
- Platform-specific flags
Writing Cython Code
Basic Pattern
Creating .pxd Files
Cross-Module Imports
Import Pattern
The corenormalize.py module imports directly from the Cython extension
without pure Python fallbacks:
Note: Unlike some other modules that use try/except ImportError fallbacks,
normalize.py requires the Cython extension. For systems without a C++ compiler,
install from a pre-built wheel.
OpenMP Integration
For parallel processing (used inbatch_processor.pyx):
macOS OpenMP Setup
Testing Cython Code
Unit Tests
Benchmark Tests
Debugging
Print Debugging
GDB/LLDB
Memory Profiling
Common Pitfalls
1. Forgetting to Rebuild
After modifying.pyx files, always rebuild:
2. GIL Management
3. Memory Management
4. Type Declarations
Performance Tips
- Use
cdeffor internal functions - Not callable from Python, but faster - Use typed memoryviews - Faster than NumPy arrays in loops
- Minimize GIL acquisition - Use
nogilwhere possible - Use
cpdeffor hybrid - Callable from Python and fast from Cython - Profile before optimizing - Use
cython -ato see Python interactions
Annotation Output
Contributing
When contributing Cython code:- Include both
.pyxand.pywrapper - Add
.pxdfile if cross-module imports needed - Write tests that work with both backends
- Document performance characteristics
- Test on multiple platforms if possible