NormalizationService that consolidates all text normalization logic into a single, consistent interface.
Overview
Text normalization ensures consistent text representation across all components. TheNormalizationService provides purpose-specific methods for different use cases:
| Method | Use Case | Zawgyi Conversion | Zero-Width Removal |
|---|---|---|---|
for_spell_checking() | Validation pipeline | No | Yes |
for_dictionary_lookup() | Database queries | Yes | Yes |
for_comparison() | Text comparison | Yes | Yes |
for_display() | User output | No | No |
for_ingestion() | Corpus building | Yes | Yes |
NormalizationService
Basic Usage
Spell Checking Normalization
Fast normalization for the validation pipeline (no Zawgyi conversion):- Strip whitespace
- Unicode NFC normalization
- Remove zero-width characters
- Myanmar diacritic reordering
Dictionary Lookup Normalization
Complete normalization for database queries:- Strip whitespace
- Zawgyi to Unicode conversion (if detected)
- Unicode NFC normalization
- Remove zero-width characters
- Myanmar diacritic reordering
Comparison Normalization
Aggressive normalization for text comparison:Display Normalization
Minimal normalization preserving user formatting:- Unicode NFC normalization
- Myanmar diacritic reordering
- Preserves whitespace and zero-width characters
Corpus Ingestion
Full normalization for building dictionaries:NormalizationOptions
Customize normalization with options:Presets
Pre-defined presets for common use cases:Preset Configuration
| Preset | Unicode | Zero-Width | Diacritics | Zawgyi | Whitespace |
|---|---|---|---|---|---|
| SPELL_CHECK | NFC | Remove | Reorder | No | Strip |
| DICTIONARY_LOOKUP | NFC | Remove | Reorder | Convert | Strip |
| COMPARISON | NFC | Remove | Reorder | Convert | Strip |
| DISPLAY | NFC | Keep | Reorder | No | Keep |
| INGESTION | NFC | Remove | Reorder | Convert | Strip |
Myanmar Text Detection
Check if text is primarily Myanmar script:Zawgyi Handling
The service automatically detects and converts Zawgyi encoding:Convenience Functions
Module-level functions for quick access:Cython Optimization
Core normalization functions are Cython-optimized:Thread Safety
TheNormalizationService is thread-safe:
Integration
The normalization service is used throughout mySpellChecker:In SpellChecker
In Data Pipeline
In Providers
Normalization Steps
1. Unicode Normalization
Converts text to consistent Unicode form (NFC by default):2. Zero-Width Character Removal
Removes invisible characters that can cause matching issues:- Zero-width space (U+200B)
- Zero-width non-joiner (U+200C)
- Zero-width joiner (U+200D)
3. Myanmar Diacritic Reordering
Ensures consistent ordering of Myanmar diacritics:4. Zawgyi Detection and Conversion
Detects legacy Zawgyi encoding and converts to Unicode:Best Practices
- Use purpose-specific methods: Choose the right method for your use case
- Normalize at boundaries: Normalize input at system entry points
- Be consistent: Use the same normalization for related operations
- Handle Zawgyi: Enable Zawgyi conversion for user-facing input
- Cache results: The service uses singleton pattern for efficiency
See Also
- Text Utilities - Other text processing utilities
- Configuration - Configuration options
- Data Pipeline - Corpus processing