Phonetic Matching

The PhoneticHasher enables fuzzy matching based on pronunciation to catch spelling errors where words sound the same but are spelled differently.

Phonetic Hasher

mySpellChecker includes a custom PhoneticHasher optimized for Myanmar phonology. It converts a string into a phonetic code, allowing for fuzzy matching based on pronunciation.

Key Features

Consonant Grouping: Maps similar-sounding consonants (e.g., က vs ဂ) to the same code.
Tone Normalization: Can ignore tone marks (e.g., ကာ, ကား, က) to find tonal errors.
Vowel Normalization: Treats short and long vowels (e.g., ိ vs ီ) as identical.
Adaptive Length Encoding: Automatically extends the code length for compound words to preserve phonetic information.
Nasal Normalization: Optionally unifies various nasal endings (e.g., န်, မ်, င်) to the Anusvara (ံ) sound. Note: normalize_nasals defaults to False — you must explicitly enable it if needed.

Python API

from myspellchecker.text.phonetic import PhoneticHasher

hasher = PhoneticHasher()

# 1. Encoding
code1 = hasher.encode("မြန်")  # -> 'p-medial_r-vowel_a-n'
code2 = hasher.encode("မျန်")  # -> 'p-medial_y-vowel_a-n' (Note difference)

# 2. Similarity Check
is_similar = hasher.similar(code1, code2, max_distance=1)
# True (Small edit distance in phonetic space)

# 3. Finding Variants
variants = hasher.get_phonetic_variants("မြန်")
# {'မြန်', 'မျန်', 'ဗြန်', ...}

# 4. Tonal Variants (for Real-Word Error detection)
tonal_vars = hasher.get_tonal_variants("ကား")
# {'ကား', 'ကာ', 'က'}

How It Works

Encoding Process

Normalization: Text is converted to Unicode NFC form.
Preprocessing: If normalize_nasals=True, common nasal endings (န်, မ်, င်) are normalized to ံ. (Disabled by default.)
Mapping: Each character is mapped to a phonetic group code (e.g., KA_GROUP, MEDIAL_R).
Filtering: Tone marks and viramas are optionally stripped.
Concatenation: Codes are joined to form the final hash.

Scoring

The find_phonetically_similar method delegates to compute_phonetic_similarity, which uses a multi-factor scoring approach:

Character-level similarity: Compares characters pairwise using Myanmar substitution costs (MYANMAR_SUBSTITUTION_COSTS), visual confusability, and phonetic group membership.
Length penalty: Proportional penalty for length differences: (max_len - min_len) / max_len * 0.2.
Phonetic code blending: Levenshtein distance on phonetic codes is blended with character-level similarity, with code weight scaled by input length (min(0.4, len / 20.0)).

Score = (1 - w) \times CharSimilarity + w \times CodeSimilarity - LengthPenalty

Where CharSimilarity is the average per-character similarity using substitution costs, CodeSimilarity = 1 - Levenshtein(Code_A, Code_B) / MaxLen, and w = min(0.4, len(input) / 20.0).

Usage in Spell Checking

Note: Phonetic hashing is computed at runtime, not stored in the database schema. There is no phonetic_hash column in the database tables. Hashes are generated on-the-fly using the PhoneticHasher during lookup.

Lookup: When a word is unknown (OOV), the system computes its phonetic hash at runtime.
Comparison: The hash is compared against hashes computed for dictionary candidates (from SymSpell delete index).
Suggestion: Candidates are matched by:
- Exact Hash Match: Words that sound identical.
- Near Hash Match: Words that sound very similar (e.g., slight medial difference).

Configuration

Controlled by use_phonetic in SpellCheckerConfig.

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Phonetic Hasher

Key Features

Python API

How It Works

Encoding Process

Scoring

Usage in Spell Checking

Configuration

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Phonetic Hasher

​Key Features

​Python API

​How It Works

​Encoding Process

​Scoring

​Usage in Spell Checking

​Configuration

Phonetic Hasher

Key Features

Python API

How It Works

Encoding Process

Scoring

Usage in Spell Checking

Configuration