Skip to main content
Layer 2 extends syllable validation to handle multi-syllable words, including dictionary lookup, compound validation, OOV recovery, and context-aware ranking.

Overview

Word validation extends syllable validation to handle multi-syllable words. It includes:
  • Dictionary lookup for complete words
  • Compound word validation (SymSpell)
  • Productive reduplication validation (ReduplicationEngine)
  • Compound word synthesis via DP segmentation (CompoundResolver)
  • OOV (Out-of-Vocabulary) recovery via morphological analysis
  • Context-aware suggestion ranking
  • Morpheme-level suggestion correction (MorphemeSuggestionStrategy)
  • Colloquial variant detection

Architecture

  +-------------------+
  | Input Text        |
  +---------+---------+
            |
            v
  +-------------------------------+
  | Layer 1: Syllable Validation  |
  |   See syllable-validation.md  |
  +---------------+---------------+
                  |
                  | syllables valid
                  v
  +-------------------------------------------+
  | Layer 2: Word Validation (THIS MODULE)    |
  |                                           |
  |   +---------------------+                 |
  |   | Dictionary lookup   |                 |
  |   +---------+-----------+                 |
  |             |                             |
  |             v                             |
  |   +---------------------+                 |
  |   | Compound validation |  (SymSpell)     |
  |   +---------+-----------+                 |
  |             |                             |
  |             v                             |
  |   +----------------------------+          |
  |   | Reduplication validation   |  (NEW)   |
  |   +---------+------------------+          |
  |             |                             |
  |             v                             |
  |   +----------------------------+          |
  |   | Compound synthesis (DP)    |  (NEW)   |
  |   +---------+------------------+          |
  |             |                             |
  |             v                             |
  |   +-----------------------------+         |
  |   | Context-aware suggestions   |         |
  |   | (incl. morpheme correction) |         |
  |   +-----------------------------+         |
  +-------------------------------------------+

WordValidator

Initialization

from myspellchecker.core.validators import WordValidator
from myspellchecker.core.config import SpellCheckerConfig

validator = WordValidator(
    config=SpellCheckerConfig(),
    segmenter=segmenter,
    word_repository=provider,
    syllable_repository=provider,
    symspell=symspell,
    context_checker=context_checker,       # Optional
    suggestion_strategy=strategy,           # Optional
    reduplication_engine=redup_engine,      # Optional (Phase 1)
    compound_resolver=compound_resolver,   # Optional (Phase 2/3)
)

Factory Method

from myspellchecker.core.validators import WordValidator

validator = WordValidator.create(
    word_repository=provider,
    syllable_repository=provider,
    segmenter=segmenter,
    symspell=symspell,
    config=config,
    context_checker=context_checker,
)

Basic Usage

# Validate text and get errors
errors = validator.validate("မြန်မာနိုင်ငံသည်")

for error in errors:
    print(f"Error: {error.text} at position {error.position}")
    print(f"Type: {error.error_type}")
    print(f"Suggestions: {error.suggestions[:3]}")
    print(f"Confidence: {error.confidence}")

Validation Process

Step 1: Word Segmentation

Text is segmented into words using the configured segmenter:
words = segmenter.segment_words(text)
# ["မြန်မာ", "နိုင်ငံ", "သည်"]

Step 2: Dictionary Lookup

Each word is checked against the word repository:
if word_repository.is_valid_word(word):
    # Valid word - check for colloquial variants
    pass
else:
    # Not found - continue to compound check
    pass

Step 3: Compound Validation

Words not found directly may be valid compounds:
# Check if word splits into valid parts with no edits
compound_check = symspell.lookup_compound(word, max_edit_distance=0)

if compound_check and compound_check[0][1] == 0:
    # Valid compound word
    pass

Step 4: Reduplication Validation

Words not found in the dictionary or via compound check may be productive reduplications of known words:
# Check if word is a valid reduplication (e.g., ကောင်းကောင်း from ကောင်း)
if reduplication_engine.analyze(word, dict_check, freq_check, pos_check):
    # Valid reduplication - accept without error
    pass
Supported patterns:
  • AA: Simple repetition (ကောင်းကောင်း “well”)
  • AABB: Each syllable doubles (သေသေချာချာ “carefully”)
  • ABAB: Whole word repeats (ခဏခဏ “frequently”)
  • RHYME: Known rhyme pairs from grammar/patterns.py
Safeguards: base must be in dictionary, frequency >= 5, POS must be V/ADJ/ADV/N.

Step 5: Compound Synthesis

Words not matching any previous check may be valid compounds formed from known dictionary morphemes:
# Check if word splits into valid morphemes (e.g., ကျောင်းသား = ကျောင်း + သား)
if compound_resolver.resolve(word, dict_check, freq_check, pos_check):
    # Valid compound - accept without error
    pass
Uses dynamic programming for optimal segmentation. Allowed patterns: N+N, V+V, N+V, V+N, ADJ+N. Blocked patterns: P+P, P+N, N+P. Safeguards: all parts in dictionary, frequency >= 10 per morpheme, max 4 parts.

Step 6: OOV Recovery (Morphology)

For unknown words, attempt morphological analysis:
from myspellchecker.text.morphology import analyze_word

# Analyze word structure
analysis = analyze_word(word, dictionary_check=is_valid)

if analysis.root and analysis.suffixes:
    # Recovered root: စား with suffixes: ['ခဲ့', 'သည်']
    # Generate suggestions from similar roots
    pass

Step 7: Suggestion Generation

Suggestions are generated via the unified strategy pipeline:
# Uses multiple sources:
# - SymSpell edit distance suggestions
# - Morphology-based suggestions
# - Compound splitting suggestions
# - Morpheme-level corrections (NEW - fixes typos inside compounds/reduplications)
# - Context ranking from N-grams

suggestions = strategy.suggest(word, context)

Step 8: Context Ranking

Suggestions are ranked using bidirectional context:
# Context from surrounding words
prev_word = words[i-1] if i > 0 else None
next_word = words[i+1] if i < len(words)-1 else None

# Context-aware suggestions via NgramContextChecker
context_suggestions = context_checker.suggest(
    prev_word=prev_word,
    current_word=word,
    next_word=next_word,
)

OOV Recovery Details

Morphological Analysis

The morphology module decomposes unknown words:
from myspellchecker.text.morphology import analyze_word, WordAnalysis

# Word: စားခဲ့သည် (ate - formal)
analysis = analyze_word("စားခဲ့သည்", dictionary_check=lambda w: w in dict)

# WordAnalysis:
#   original: "စားခဲ့သည်"
#   root: "စား" (eat)
#   suffixes: ["ခဲ့", "သည်"] (past tense, formal ending)

Enhanced Suggestions

OOV recovery improves suggestion quality:
# Original word has typo in root
word = "စားခဲ့သည်"  # typo: စား → စာ

# Without morphology: generic edit-distance suggestions
# With morphology:
#   1. Finds root "စာ" (close to "စား")
#   2. Gets suggestions for root: ["စား", "စာ"]
#   3. Reconstructs: "စားခဲ့သည်", "စာခဲ့သည်"
#   4. Better suggestions for the full inflected form

Colloquial Variant Detection

Configuration

from myspellchecker.core.config import SpellCheckerConfig, ValidationConfig

config = SpellCheckerConfig(
    validation=ValidationConfig(
        colloquial_strictness="lenient"  # or "strict", "off"
    )
)

Behavior by Mode

ModeBehavior
strictFlag colloquial as error, suggest standard form
lenientInfo note with low confidence, not counted as error
offNo special handling

Examples

from myspellchecker.text.phonetic_data import is_colloquial_variant, get_standard_forms

# Common colloquial forms
is_colloquial_variant("ကျနော်")  # True (colloquial for ကျွန်တော်)
get_standard_forms("ကျနော်")     # {"ကျွန်တော်"}

is_colloquial_variant("ပြော")    # True (colloquial for ပြောသည်)

Interface Segregation

WordValidator uses narrow repository interfaces:

WordRepository Interface

class WordRepository(Protocol):
    def is_valid_word(self, word: str) -> bool: ...
    def get_word_frequency(self, word: str) -> int: ...
    def get_all_words(self) -> Iterator[Tuple[str, int]]: ...

SyllableRepository Interface

class SyllableRepository(Protocol):
    def is_valid_syllable(self, syllable: str) -> bool: ...
    def get_syllable_frequency(self, syllable: str) -> int: ...
This design:
  • Reduces coupling to full DictionaryProvider
  • Makes testing easier with minimal mocks
  • Allows different storage backends

Error Types

WordValidator returns WordError objects:
from myspellchecker.core.response import WordError

class WordError(Error):
    text: str           # The invalid word
    position: int       # Character position in text
    suggestions: List[str]  # Correction suggestions
    confidence: float   # Error confidence (0.0-1.0)
    error_type: str     # "invalid_word", "colloquial_variant", etc.
    syllable_count: int # Number of syllables in word

Error Type Values

TypeDescription
invalid_wordUnknown word, not in dictionary
colloquial_variantColloquial spelling (strict mode)
colloquial_infoColloquial spelling (lenient mode)

Configuration Options

Via SpellCheckerConfig

from myspellchecker.core.config import SpellCheckerConfig

config = SpellCheckerConfig(
    # Suggestion settings
    max_suggestions=5,
    max_edit_distance=2,

    # Feature flags
    use_phonetic=True,        # Phonetic similarity
    use_context_checker=True, # N-gram context ranking

    # Validation settings
    validation=ValidationConfig(
        word_error_confidence=0.85,
        colloquial_strictness="lenient",
        allow_extended_myanmar=False,

        # Morphological synthesis (NEW)
        use_reduplication_validation=True,       # Accept valid reduplications
        reduplication_min_base_frequency=5,      # Min frequency for base word
        reduplication_cache_size=1024,            # Cache size

        use_compound_synthesis=True,             # Accept valid compounds
        compound_min_morpheme_frequency=10,      # Min frequency per morpheme
        compound_max_parts=4,                    # Max parts in compound
        compound_cache_size=1024,                # Cache size
    )
)

Usage with SpellChecker

Validation Level

from myspellchecker import SpellChecker
from myspellchecker.core.constants import ValidationLevel

checker = SpellChecker()

# Word-level validation (includes syllable + word)
result = checker.check(text, level=ValidationLevel.WORD)

# Syllable-only validation (faster, less coverage)
result = checker.check(text, level=ValidationLevel.SYLLABLE)

Performance Comparison

LevelSpeedCoverage
SYLLABLE~10ms~90% of errors
WORD~50ms~95% of errors

Suggestion Strategy

Composite Strategy Pipeline

from myspellchecker.algorithms.suggestion_strategy import (
    CompositeSuggestionStrategy,
    SymSpellSuggestionStrategy,
    MorphologySuggestionStrategy,
    CompoundSuggestionStrategy,
)
from myspellchecker.algorithms.morpheme_suggestion_strategy import (
    MorphemeSuggestionStrategy,
)

strategy = CompositeSuggestionStrategy([
    SymSpellSuggestionStrategy(symspell),           # Edit-distance suggestions
    MorphologySuggestionStrategy(symspell, ...),    # OOV recovery
    CompoundSuggestionStrategy(symspell, ...),      # Word splitting/joining
    MorphemeSuggestionStrategy(                     # Morpheme-level correction (NEW)
        compound_resolver=resolver,
        reduplication_engine=engine,
        symspell=symspell,
        dictionary_check=dict_check,
    ),
])

# Unified suggestion with source attribution
result = strategy.suggest(word, context)
# Returns: SuggestionResult with ranked suggestions and sources

Testing

Unit Test Example

from myspellchecker.core.validators import WordValidator
from myspellchecker.providers import MemoryProvider

# Create minimal test provider
provider = MemoryProvider(
    syllables={"မြန်": 100, "မာ": 50},
    words={"မြန်မာ": 80}
)

validator = WordValidator.create(
    word_repository=provider,
    syllable_repository=provider,
    segmenter=MockSegmenter(),
    symspell=MockSymSpell(),
)

# Test validation
errors = validator.validate("မြန်မာ")
assert len(errors) == 0  # Valid word

See Also