Word Validation - mySpellChecker

Layer 2 extends syllable validation to handle multi-syllable words, including dictionary lookup, compound validation, OOV recovery, and context-aware ranking.

Overview

Word validation extends syllable validation to handle multi-syllable words. It includes:

Dictionary lookup for complete words
Compound word validation (SymSpell)
Productive reduplication validation (ReduplicationEngine)
Compound word synthesis via DP segmentation (CompoundResolver)
OOV (Out-of-Vocabulary) recovery via morphological analysis
Context-aware suggestion ranking
Morpheme-level suggestion correction (MorphemeSuggestionStrategy)
Colloquial variant detection

Architecture

  +-------------------+
  | Input Text        |
  +---------+---------+
            |
            v
  +-------------------------------+
  | Layer 1: Syllable Validation  |
  |   See syllable-validation.md  |
  +---------------+---------------+
                  |
                  | syllables valid
                  v
  +-------------------------------------------+
  | Layer 2: Word Validation (THIS MODULE)    |
  |                                           |
  |   +---------------------+                 |
  |   | Dictionary lookup   |                 |
  |   +---------+-----------+                 |
  |             |                             |
  |             v                             |
  |   +---------------------+                 |
  |   | Compound validation |  (SymSpell)     |
  |   +---------+-----------+                 |
  |             |                             |
  |             v                             |
  |   +----------------------------+          |
  |   | Reduplication validation   |  (NEW)   |
  |   +---------+------------------+          |
  |             |                             |
  |             v                             |
  |   +----------------------------+          |
  |   | Compound synthesis (DP)    |  (NEW)   |
  |   +---------+------------------+          |
  |             |                             |
  |             v                             |
  |   +-----------------------------+         |
  |   | Context-aware suggestions   |         |
  |   | (incl. morpheme correction) |         |
  |   +-----------------------------+         |
  +-------------------------------------------+

WordValidator

Initialization

from myspellchecker.core.validators import WordValidator
from myspellchecker.core.config import SpellCheckerConfig

validator = WordValidator(
    config=SpellCheckerConfig(),
    segmenter=segmenter,
    word_repository=provider,
    syllable_repository=provider,
    symspell=symspell,
    context_checker=context_checker,       # Optional
    suggestion_strategy=strategy,           # Optional
    reduplication_engine=redup_engine,      # Optional (Phase 1)
    compound_resolver=compound_resolver,   # Optional (Phase 2/3)
)

Factory Method

from myspellchecker.core.validators import WordValidator

validator = WordValidator.create(
    word_repository=provider,
    syllable_repository=provider,
    segmenter=segmenter,
    symspell=symspell,
    config=config,
    context_checker=context_checker,
)

Basic Usage

# Validate text and get errors
errors = validator.validate("မြန်မာနိုင်ငံသည်")

for error in errors:
    print(f"Error: {error.text} at position {error.position}")
    print(f"Type: {error.error_type}")
    print(f"Suggestions: {error.suggestions[:3]}")
    print(f"Confidence: {error.confidence}")

Validation Process

Step 1: Word Segmentation

Text is segmented into words using the configured segmenter:

words = segmenter.segment_words(text)
# ["မြန်မာ", "နိုင်ငံ", "သည်"]

Step 2: Dictionary Lookup

Each word is checked against the word repository:

if word_repository.is_valid_word(word):
    # Valid word - check for colloquial variants
    pass
else:
    # Not found - continue to compound check
    pass

Step 3: Compound Validation

Words not found directly may be valid compounds:

# Check if word splits into valid parts with no edits
compound_check = symspell.lookup_compound(word, max_edit_distance=0)

if compound_check and compound_check[0][1] == 0:
    # Valid compound word
    pass

Step 4: Reduplication Validation

Words not found in the dictionary or via compound check may be productive reduplications of known words:

# Check if word is a valid reduplication (e.g., ကောင်းကောင်း from ကောင်း)
if reduplication_engine.analyze(word, dict_check, freq_check, pos_check):
    # Valid reduplication - accept without error
    pass

Supported patterns:

AA: Simple repetition (ကောင်းကောင်း “well”)
AABB: Each syllable doubles (သေသေချာချာ “carefully”)
ABAB: Whole word repeats (ခဏခဏ “frequently”)
RHYME: Known rhyme pairs from grammar/patterns.py

Safeguards: base must be in dictionary, frequency >= 5, POS must be V/ADJ/ADV/N.

Step 5: Compound Synthesis

Words not matching any previous check may be valid compounds formed from known dictionary morphemes:

# Check if word splits into valid morphemes (e.g., ကျောင်းသား = ကျောင်း + သား)
if compound_resolver.resolve(word, dict_check, freq_check, pos_check):
    # Valid compound - accept without error
    pass

Uses dynamic programming for optimal segmentation. Allowed patterns: N+N, V+V, N+V, V+N, ADJ+N. Blocked patterns: P+P, P+N, N+P. Safeguards: all parts in dictionary, frequency >= 10 per morpheme, max 4 parts.

Step 6: OOV Recovery (Morphology)

For unknown words, attempt morphological analysis:

from myspellchecker.text.morphology import analyze_word

# Analyze word structure
analysis = analyze_word(word, dictionary_check=is_valid)

if analysis.root and analysis.suffixes:
    # Recovered root: စား with suffixes: ['ခဲ့', 'သည်']
    # Generate suggestions from similar roots
    pass

Step 7: Suggestion Generation

Suggestions are generated via the unified strategy pipeline:

# Uses multiple sources:
# - SymSpell edit distance suggestions
# - Morphology-based suggestions
# - Compound splitting suggestions
# - Morpheme-level corrections (NEW - fixes typos inside compounds/reduplications)
# - Context ranking from N-grams

suggestions = strategy.suggest(word, context)

Step 8: Context Ranking

Suggestions are ranked using bidirectional context:

# Context from surrounding words
prev_word = words[i-1] if i > 0 else None
next_word = words[i+1] if i < len(words)-1 else None

# Context-aware suggestions via NgramContextChecker
context_suggestions = context_checker.suggest(
    prev_word=prev_word,
    current_word=word,
    next_word=next_word,
)

OOV Recovery Details

Morphological Analysis

The morphology module decomposes unknown words:

from myspellchecker.text.morphology import analyze_word, WordAnalysis

# Word: စားခဲ့သည် (ate - formal)
analysis = analyze_word("စားခဲ့သည்", dictionary_check=lambda w: w in dict)

# WordAnalysis:
#   original: "စားခဲ့သည်"
#   root: "စား" (eat)
#   suffixes: ["ခဲ့", "သည်"] (past tense, formal ending)

Enhanced Suggestions

OOV recovery improves suggestion quality:

# Original word has typo in root
word = "စားခဲ့သည်"  # typo: စား → စာ

# Without morphology: generic edit-distance suggestions
# With morphology:
#   1. Finds root "စာ" (close to "စား")
#   2. Gets suggestions for root: ["စား", "စာ"]
#   3. Reconstructs: "စားခဲ့သည်", "စာခဲ့သည်"
#   4. Better suggestions for the full inflected form

Colloquial Variant Detection

Configuration

from myspellchecker.core.config import SpellCheckerConfig, ValidationConfig

config = SpellCheckerConfig(
    validation=ValidationConfig(
        colloquial_strictness="lenient"  # or "strict", "off"
    )
)

Behavior by Mode

Mode	Behavior
`strict`	Flag colloquial as error, suggest standard form
`lenient`	Info note with low confidence, not counted as error
`off`	No special handling

Examples

from myspellchecker.text.phonetic_data import is_colloquial_variant, get_standard_forms

# Common colloquial forms
is_colloquial_variant("ကျနော်")  # True (colloquial for ကျွန်တော်)
get_standard_forms("ကျနော်")     # {"ကျွန်တော်"}

is_colloquial_variant("ပြော")    # True (colloquial for ပြောသည်)

Interface Segregation

WordValidator uses narrow repository interfaces:

WordRepository Interface

class WordRepository(Protocol):
    def is_valid_word(self, word: str) -> bool: ...
    def get_word_frequency(self, word: str) -> int: ...
    def get_all_words(self) -> Iterator[Tuple[str, int]]: ...

SyllableRepository Interface

class SyllableRepository(Protocol):
    def is_valid_syllable(self, syllable: str) -> bool: ...
    def get_syllable_frequency(self, syllable: str) -> int: ...

This design:

Reduces coupling to full DictionaryProvider
Makes testing easier with minimal mocks
Allows different storage backends

Error Types

WordValidator returns WordError objects:

from myspellchecker.core.response import WordError

class WordError(Error):
    text: str           # The invalid word
    position: int       # Character position in text
    suggestions: List[str]  # Correction suggestions
    confidence: float   # Error confidence (0.0-1.0)
    error_type: str     # "invalid_word", "colloquial_variant", etc.
    syllable_count: int # Number of syllables in word

Error Type Values

Type	Description
`invalid_word`	Unknown word, not in dictionary
`colloquial_variant`	Colloquial spelling (strict mode)
`colloquial_info`	Colloquial spelling (lenient mode)

Configuration Options

Via SpellCheckerConfig

from myspellchecker.core.config import SpellCheckerConfig

config = SpellCheckerConfig(
    # Suggestion settings
    max_suggestions=5,
    max_edit_distance=2,

    # Feature flags
    use_phonetic=True,        # Phonetic similarity
    use_context_checker=True, # N-gram context ranking

    # Validation settings
    validation=ValidationConfig(
        word_error_confidence=0.85,
        colloquial_strictness="lenient",
        allow_extended_myanmar=False,

        # Morphological synthesis (NEW)
        use_reduplication_validation=True,       # Accept valid reduplications
        reduplication_min_base_frequency=5,      # Min frequency for base word
        reduplication_cache_size=1024,            # Cache size

        use_compound_synthesis=True,             # Accept valid compounds
        compound_min_morpheme_frequency=10,      # Min frequency per morpheme
        compound_max_parts=4,                    # Max parts in compound
        compound_cache_size=1024,                # Cache size
    )
)

Usage with SpellChecker

Validation Level

from myspellchecker import SpellChecker
from myspellchecker.core.constants import ValidationLevel

checker = SpellChecker()

# Word-level validation (includes syllable + word)
result = checker.check(text, level=ValidationLevel.WORD)

# Syllable-only validation (faster, less coverage)
result = checker.check(text, level=ValidationLevel.SYLLABLE)

Performance Comparison

Level	Speed	Coverage
SYLLABLE	~10ms	~90% of errors
WORD	~50ms	~95% of errors

Suggestion Strategy

Composite Strategy Pipeline

from myspellchecker.algorithms.suggestion_strategy import (
    CompositeSuggestionStrategy,
    SymSpellSuggestionStrategy,
    MorphologySuggestionStrategy,
    CompoundSuggestionStrategy,
)
from myspellchecker.algorithms.morpheme_suggestion_strategy import (
    MorphemeSuggestionStrategy,
)

strategy = CompositeSuggestionStrategy([
    SymSpellSuggestionStrategy(symspell),           # Edit-distance suggestions
    MorphologySuggestionStrategy(symspell, ...),    # OOV recovery
    CompoundSuggestionStrategy(symspell, ...),      # Word splitting/joining
    MorphemeSuggestionStrategy(                     # Morpheme-level correction (NEW)
        compound_resolver=resolver,
        reduplication_engine=engine,
        symspell=symspell,
        dictionary_check=dict_check,
    ),
])

# Unified suggestion with source attribution
result = strategy.suggest(word, context)
# Returns: SuggestionResult with ranked suggestions and sources

Testing

Unit Test Example

from myspellchecker.core.validators import WordValidator
from myspellchecker.providers import MemoryProvider

# Create minimal test provider
provider = MemoryProvider(
    syllables={"မြန်": 100, "မာ": 50},
    words={"မြန်မာ": 80}
)

validator = WordValidator.create(
    word_repository=provider,
    syllable_repository=provider,
    segmenter=MockSegmenter(),
    symspell=MockSymSpell(),
)

# Test validation
errors = validator.validate("မြန်မာ")
assert len(errors) == 0  # Valid word

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Overview

​Architecture

​WordValidator

​Initialization

​Factory Method

​Basic Usage

​Validation Process

​Step 1: Word Segmentation

​Step 2: Dictionary Lookup

​Step 3: Compound Validation

​Step 4: Reduplication Validation

​Step 5: Compound Synthesis

​Step 6: OOV Recovery (Morphology)

​Step 7: Suggestion Generation

​Step 8: Context Ranking

​OOV Recovery Details

​Morphological Analysis

​Enhanced Suggestions

​Colloquial Variant Detection

​Configuration

​Behavior by Mode

​Examples

​Interface Segregation

​WordRepository Interface

​SyllableRepository Interface

​Error Types

​Error Type Values

​Configuration Options

​Via SpellCheckerConfig

​Usage with SpellChecker

​Validation Level

​Performance Comparison

​Suggestion Strategy

​Composite Strategy Pipeline

​Testing

​Unit Test Example

​See Also

Overview

Architecture

WordValidator

Initialization

Factory Method

Basic Usage

Validation Process

Step 1: Word Segmentation

Step 2: Dictionary Lookup

Step 3: Compound Validation

Step 4: Reduplication Validation

Step 5: Compound Synthesis

Step 6: OOV Recovery (Morphology)

Step 7: Suggestion Generation

Step 8: Context Ranking

OOV Recovery Details

Morphological Analysis

Enhanced Suggestions

Colloquial Variant Detection

Configuration

Behavior by Mode

Examples

Interface Segregation

WordRepository Interface

SyllableRepository Interface

Error Types

Error Type Values

Configuration Options

Via SpellCheckerConfig

Usage with SpellChecker

Validation Level

Performance Comparison

Suggestion Strategy

Composite Strategy Pipeline

Testing

Unit Test Example

See Also