Skip to main content
mySpellChecker provides a unified NormalizationService that consolidates all text normalization logic into a single, consistent interface.

Overview

Text normalization ensures consistent text representation across all components. The NormalizationService provides purpose-specific methods for different use cases:
MethodUse CaseZawgyi ConversionZero-Width Removal
for_spell_checking()Validation pipelineNoYes
for_dictionary_lookup()Database queriesYesYes
for_comparison()Text comparisonYesYes
for_display()User outputNoNo
for_ingestion()Corpus buildingYesYes

NormalizationService

Basic Usage

from myspellchecker.text.normalization_service import (
    NormalizationService,
    get_normalization_service
)

# Get singleton service
service = get_normalization_service()

# Or create new instance
service = NormalizationService()

Spell Checking Normalization

Fast normalization for the validation pipeline (no Zawgyi conversion):
normalized = service.for_spell_checking("  မြန်မာ  ")
print(normalized)  # "မြန်မာ"
Pipeline:
  1. Strip whitespace
  2. Unicode NFC normalization
  3. Remove zero-width characters
  4. Myanmar diacritic reordering

Dictionary Lookup Normalization

Complete normalization for database queries:
normalized = service.for_dictionary_lookup(user_input)
Pipeline:
  1. Strip whitespace
  2. Zawgyi to Unicode conversion (if detected)
  3. Unicode NFC normalization
  4. Remove zero-width characters
  5. Myanmar diacritic reordering

Comparison Normalization

Aggressive normalization for text comparison:
a = service.for_comparison(user_input)
b = service.for_comparison(dictionary_entry)
if a == b:
    print("Match!")

Display Normalization

Minimal normalization preserving user formatting:
normalized = service.for_display(text)
Pipeline:
  1. Unicode NFC normalization
  2. Myanmar diacritic reordering
  3. Preserves whitespace and zero-width characters

Corpus Ingestion

Full normalization for building dictionaries:
normalized = service.for_ingestion(corpus_line)

NormalizationOptions

Customize normalization with options:
from myspellchecker.text.normalization_service import (
    NormalizationService,
    NormalizationOptions
)

options = NormalizationOptions(
    unicode_form="NFC",       # NFC, NFD, NFKC, NFKD
    remove_zero_width=True,   # Remove zero-width characters
    reorder_diacritics=True,  # Myanmar-specific reordering
    convert_zawgyi=False,     # Detect and convert Zawgyi
    strip_whitespace=True,    # Strip leading/trailing whitespace
    lowercase=False           # Lowercase (for non-Myanmar text)
)

service = NormalizationService()
normalized = service.normalize(text, options)

Presets

Pre-defined presets for common use cases:
from myspellchecker.text.normalization_service import (
    PRESET_SPELL_CHECK,
    PRESET_DICTIONARY_LOOKUP,
    PRESET_COMPARISON,
    PRESET_DISPLAY,
    PRESET_INGESTION
)

service = NormalizationService()
normalized = service.normalize(text, PRESET_COMPARISON)

Preset Configuration

PresetUnicodeZero-WidthDiacriticsZawgyiWhitespace
SPELL_CHECKNFCRemoveReorderNoStrip
DICTIONARY_LOOKUPNFCRemoveReorderConvertStrip
COMPARISONNFCRemoveReorderConvertStrip
DISPLAYNFCKeepReorderNoKeep
INGESTIONNFCRemoveReorderConvertStrip

Myanmar Text Detection

Check if text is primarily Myanmar script:
service = NormalizationService()

is_myanmar = service.is_myanmar_text("မြန်မာ")  # True
is_myanmar = service.is_myanmar_text("Hello")  # False
is_myanmar = service.is_myanmar_text("Hello မြန်မာ")  # Depends on threshold

Zawgyi Handling

The service automatically detects and converts Zawgyi encoding:
from myspellchecker.core.config.text_configs import ZawgyiConfig

# Custom Zawgyi configuration
zawgyi_config = ZawgyiConfig(
    conversion_threshold=0.9,     # Probability threshold for conversion
    myanmar_text_threshold=0.3    # Min Myanmar character ratio
)

service = NormalizationService(zawgyi_config=zawgyi_config)

# Will convert Zawgyi if probability >= 0.9
normalized = service.for_dictionary_lookup(potentially_zawgyi_text)

Convenience Functions

Module-level functions for quick access:
from myspellchecker.text.normalization_service import (
    normalize_for_spell_checking,
    normalize_for_lookup,
    normalize_for_comparison
)

# These use the default singleton service
normalized = normalize_for_spell_checking(text)
normalized = normalize_for_lookup(text)
normalized = normalize_for_comparison(text)

Cython Optimization

Core normalization functions are Cython-optimized:
# These are used internally by NormalizationService
from myspellchecker.text.normalize_c import (
    remove_zero_width_chars,      # Fast zero-width removal
    reorder_myanmar_diacritics,   # Diacritic reordering
    get_myanmar_ratio             # Myanmar character ratio
)

Thread Safety

The NormalizationService is thread-safe:
from concurrent.futures import ThreadPoolExecutor

service = get_normalization_service()  # Thread-safe singleton

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = [
        executor.submit(service.for_spell_checking, text)
        for text in texts
    ]
    results = [f.result() for f in futures]

Integration

The normalization service is used throughout mySpellChecker:

In SpellChecker

from myspellchecker import SpellChecker
from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(provider=provider)
# Internally uses NormalizationService for consistent normalization
result = checker.check(text)

In Data Pipeline

from myspellchecker.data_pipeline import Pipeline

pipeline = Pipeline()
# Uses for_ingestion() when processing corpus files
pipeline.ingest("corpus.txt")

In Providers

from myspellchecker.providers import SQLiteProvider

provider = SQLiteProvider()
# Uses normalized form before database queries
is_valid = provider.is_valid_word("မြန်မာ")

Normalization Steps

1. Unicode Normalization

Converts text to consistent Unicode form (NFC by default):
import unicodedata

# Composed form (NFC)
text = unicodedata.normalize("NFC", text)

2. Zero-Width Character Removal

Removes invisible characters that can cause matching issues:
  • Zero-width space (U+200B)
  • Zero-width non-joiner (U+200C)
  • Zero-width joiner (U+200D)

3. Myanmar Diacritic Reordering

Ensures consistent ordering of Myanmar diacritics:
# Example: the syllable ကော ("kaw")
# Before: U+1031 U+1000 U+102C  (ေ stored before က — non-canonical)
# After:  U+1000 U+1031 U+102C  (က before ေ — canonical Unicode order)
# Both render visually as: ကော
# The ေ vowel always appears to the left visually, regardless of codepoint order.

4. Zawgyi Detection and Conversion

Detects legacy Zawgyi encoding and converts to Unicode:
# Requires myanmar-tools package
# pip install myanmar-tools

Best Practices

  1. Use purpose-specific methods: Choose the right method for your use case
  2. Normalize at boundaries: Normalize input at system entry points
  3. Be consistent: Use the same normalization for related operations
  4. Handle Zawgyi: Enable Zawgyi conversion for user-facing input
  5. Cache results: The service uses singleton pattern for efficiency

See Also