PhoneticHasher enables fuzzy matching based on pronunciation to catch spelling errors where words sound the same but are spelled differently.
Phonetic Hasher
mySpellChecker includes a customPhoneticHasher optimized for Myanmar phonology. It converts a string into a phonetic code, allowing for fuzzy matching based on pronunciation.
Key Features
- Consonant Grouping: Maps similar-sounding consonants (e.g.,
ကvsဂ) to the same code. - Tone Normalization: Can ignore tone marks (e.g.,
ကာ,ကား,က) to find tonal errors. - Vowel Normalization: Treats short and long vowels (e.g.,
ိvsီ) as identical. - Adaptive Length Encoding: Automatically extends the code length for compound words to preserve phonetic information.
- Nasal Normalization: Optionally unifies various nasal endings (e.g.,
န်,မ်,င်) to the Anusvara (ံ) sound. Note:normalize_nasalsdefaults toFalse— you must explicitly enable it if needed.
Python API
How It Works
Encoding Process
- Normalization: Text is converted to Unicode NFC form.
- Preprocessing: If
normalize_nasals=True, common nasal endings (န်,မ်,င်) are normalized toံ. (Disabled by default.) - Mapping: Each character is mapped to a phonetic group code (e.g.,
KA_GROUP,MEDIAL_R). - Filtering: Tone marks and viramas are optionally stripped.
- Concatenation: Codes are joined to form the final hash.
Scoring
Thefind_phonetically_similar method delegates to compute_phonetic_similarity, which uses a multi-factor scoring approach:
- Character-level similarity: Compares characters pairwise using Myanmar substitution costs (
MYANMAR_SUBSTITUTION_COSTS), visual confusability, and phonetic group membership. - Length penalty: Proportional penalty for length differences:
(max_len - min_len) / max_len * 0.2. - Phonetic code blending: Levenshtein distance on phonetic codes is blended with character-level similarity, with code weight scaled by input length (
min(0.4, len / 20.0)).
CharSimilarity is the average per-character similarity using substitution costs, CodeSimilarity = 1 - Levenshtein(Code_A, Code_B) / MaxLen, and w = min(0.4, len(input) / 20.0).
Usage in Spell Checking
Note: Phonetic hashing is computed at runtime, not stored in the database schema. There is nophonetic_hashcolumn in the database tables. Hashes are generated on-the-fly using thePhoneticHasherduring lookup.
- Lookup: When a word is unknown (OOV), the system computes its phonetic hash at runtime.
- Comparison: The hash is compared against hashes computed for dictionary candidates (from SymSpell delete index).
- Suggestion: Candidates are matched by:
- Exact Hash Match: Words that sound identical.
- Near Hash Match: Words that sound very similar (e.g., slight medial difference).
Configuration
Controlled byuse_phonetic in SpellCheckerConfig.